Steph is currently out of the office, teaching people cool Data Science stuff on a cruise at Tech Outbound. She counts on her team to keep the company’s Twitter account afloat in the meantime, so I had to think of a way to contribute. What about advertising existing content from her blog in the style of her Twitter role model Mara Averick, i.e. an informative tweet accompanied by appealing screenshots?
In this blog post I’ll show how I can cover for Steph without even reading her blog posts, using R tools to summarize blog posts! Read further if you want to know more about markovifyR
, webshot
and magick
, Steph’s fantastic blog content, and my being lazy.
Can some NLP magic produce credible Stephspeak, and to be more precise Stephtechspeak? My strategy here is to 1) create a model for generating credible Stephspeak 2) use it to generate text based on the topic of the blog post. Steph is the best person to describe her own writing, and well, having tweets that sound like her will help her (data science) cruising go unnoticed.
I shall make use of the same method as Katie Jolly in her blog post generating Rupi Kaur-style poems and Julia Silge in her post about Stack Overflow super-contributer Jon Skeet: fitting a Markov Chain model to existing and authentic text, and then using that model to generate new text.
In their posts, Katie and Julia used different packages, so I wondered which one to choose. Alex Bresler’s markovifyR
, a wrapper to the Python library markovify
won. Note that it depends on Python being installed in your system, which I didn’t care about since my laptop already met that requirement since I started playing with the cleanNLP
package.
I decided to use Steph’s very own tweets as corpus, instead of also using Locke Data’s tweets. I couldn’t rely on rtweet::get_timeline
because Twitter API only returns up to 3,200 tweets which is far less that Steph’s whole production so I asked her to request her Twitter archive and to send it to me.
steph_tweets <- readr::read_csv(“data/theStephLocke_tweets.csv”)
1 |
|
library(“magrittr”) steph_text <- steph_tweets %>% dplyr::filter(is.na(retweeted_status_id)) %>% dplyr::pull(text) %>% stringr::str_replace_all(“@[a-z,A-Z,0-9,\_]”, “someone”) %>% stringr:: str_replace_all( “https:\/\/t.co/[a-z,A-Z,0-9]”,””) %>% stringr:: str_replace_all( “http:\/\/t.co/[a-z,A-Z,0-9]*”,””) %>% trimws() markov_model <- markovifyR::generate_markovify_model(steph_text)
1 |
|
steph_speak <- function(start, markov_model, count, seed = 42){ set.seed(seed) bot_tweets <- markovifyR::markovify_text( markov_model = markov_model, maximum_sentence_length = 200, start_words = start, output_column_name = ‘stephbot_text’, count = count, tries = 100, only_distinct = TRUE, return_message = TRUE) bot_tweets$stephbot_text }
1 |
|
[1] “Check out the concept.”
[2] “Check out Intro to R training environment”
[3] “Check out this form!”
1 |
|
[1] “W00t I have time for #satRdays already!”
[2] “W00t someone talking #azurefunctions & loved the framework, scalability, & cost 2.”
[3] “W00t someone talking #AI”
1 |
|
blog_info <- readr::read_csv(“data/all_info_about_posts.csv”) blog_info <- tidyr::gather(blog_info, “tag”, “value”, 11:ncol(blog_info), na.rm = TRUE)
remove the mark of category
blog_info <- dplyr::mutate(blog_info, tag = stringr::str_replace(tag, “cat\_”, “”))
remove one tag
blog_info <- dplyr::filter(blog_info, ! tag %in% c(“statuspost”, “base”))
1 |
|
library(“tidytext”) data(“stop_words”)
get all possible start words
all_possible_starts <- markovifyR::generate_start_words(markov_model) all_possible_starts <- dplyr::select(all_possible_starts, wordStart)
for matching we’ll use a version of these words
that’s all lower case, and without hashtags
all_possible_starts <- dplyr::mutate(all_possible_starts, # escape encoding word = encodeString(wordStart), # lower case word = tolower(word)) %>% # from hashtag to real word dplyr::filter(!stringr::str_detect(word, “#”))
get_corresponding_start <- function(tags, all_possible_starts){ matches <- tibble::tibble(tag = tags) %>% tidytext::unnest_tokens(word, tag, token = “words”) %>% # remove stop words dplyr::anti_join(stop_words, by = “word”) %>% # close words fuzzyjoin::stringdist_left_join(all_possible_starts, by = c(“word”), max_dist = 1, distance_col = “dist”) %>% dplyr::filter(!is.na(wordStart)) %>% dplyr::filter(dist == min(dist))
if(nrow(matches) == 0){ “w00t” }else{ set.seed(42) sample(matches$wordStart, 1) } }
get_corresponding_start(“data”, all_possible_starts)
1 |
|
[1] “w00t”
1 |
|
steph_describe_post <- function(post_df, markov_model, all_possible_starts){ tags <- post_df$tag start <- get_corresponding_start(tags, all_possible_starts) set.seed(42) text <- steph_speak(start = start, markov_model = markov_model, count = 1, seed = sample(1:100, 1)) tweet_text <- paste(text, post_df$url[1]) tweet_text }
all_posts <- split(blog_info, blog_info$url) all_posts[[1]]$title[1]
1 |
|
[1] “Microsoft Data Platform” “Community”
[3] “pass” “elections”
1 |
|
[1] “PASS is gr8 for starting up in arms because it looks like a snow tan right? https://itsalocke.com/blog/2016-pass-board-of-directors-candidate-town-halls/”
1 |
|
[1] “user group” “Community” “r” “R” “speaking”
[6] “conferences” “satrday”
1 |
|
[1] “satRday location voting now open - be part of the tweets I used to send, follow someone as a community #rstats workshop with someone & someone for some #rstats training in Excel/SQL/R/Python - this time - trying to limit thruput to increase ratio every yr & struggle a lot - it’s kind of thing, dunno bout skype https://itsalocke.com/blog/satrday-location-voting-now-open/”
1 |
|
pr_describe_post <- function(post_df){ tags <- toString(post_df$tag) url <- post_df$url[1] praise::praise(template = paste0(“What a ${adjective} blog post about “, tags, “! “, url)) } pr_describe_post(all_posts[[1]])
1 |
|
[1] “What a premium blog post about user group, Community, r, R, speaking, conferences, satrday! https://itsalocke.com/blog/satrday-location-voting-now-open/”
1 |
|
shot_region <- function(df){ header <- df$header number <- df$number url <- df$url post_name <- df$post_name if (header == “”){ filename <- paste0(“screenshot_tests/”, post_name, number, “-title”,”.png”) webshot::webshot(url = url, cliprect = c(0, 0, 992, 200)) }else{ filename <- paste0(“screenshot_tests/”, post_name, number, “-“, header,”.png”) webshot::webshot(url = url, selector = paste0(“#”, header), expand = c(5, 5, 200, 5)) } magick::image_read(“webshot.png”) %>% magick::image_border(color = “#E8830C”, geometry = “10x10”) %>% magick::image_border(color = “#2165B6”, geometry = “10x10”) %>% magick::image_write(filename)
file.remove(“webshot.png”) }
1 |
|
get_post_info <- function(post_df){ url <- post_df$url[1] post_name <- stringr::str_replace(post_df$name[1], “\.md”, “”)
headers <- httr::GET(url) %>% httr::content() %>% rvest::html_nodes(“h2”) %>% rvest::html_attrs() %>% unlist()
tibble::tibble(url = url, post_name = post_name, header = c(“”, headers), number = seq_along(header)) }
get_post_info(all_posts[[1]])
1 |
|
Now we can apply webshooting to get all screenshots for a post, and a bit of magick
to make them prettier.
Here is how one would use the function for one post.
1 |
|
What would one tweet about this blog post by the way?
1 |
|
[1] “Agile is great for management looking to get back to Boston this year! https://itsalocke.com/blog/talking-data-and-docker/”
pr_describe_post(all_posts[[101]])
1 |
|
Here is the result!
Apart from the font issue, it’s not perfect because instead of having a strict size for the region under the title or header one should make it depend on the length of content inside each section.
So here it is, we can sort of maraaverickfy a blog post without reading it! The end result is not nearly as good as Mara’s actual tweets, but hey, reading blog posts demands time! And brain power! Still, the most obvious improvement would be to really read Steph’s posts, since they’re good ones.
Now if I were to continue with my lazy approach, I’d first try to fix fonts on screenshots. I’d also like to add chibis to the screenshots, and maybe to use rtweet
to build a real bot sending the tweets or a Shiny app combining both the text and screenshot generating functions, and the tags to plan tweeting regularly about all different (evergreen) topics covered on the blog. I also think the text-generating aspect could be explored further, and this way, maybe Steph could spend most of her time on a cruise without anyone’s noticing her Twitter absence.