A recent project of mine has been setting up a Twitter bot on innovation quotes. I enjoy this project because in addition to curating a great set of content and growing an audience around it, I have also learned a lot about coding.
From web scraping to regular expressions to social media automation, I’ve learned a lot collecting a list of over 30,000 quotes related to innovation.
Lately I’ve been turning my attention to finding quotes about computer programming, as digital-savvy is crucial to innovation today. These exercises prove great blog post material and quite “meta,” too… writing code to read quotes about writing code. I will cover one of what I hope to make a series below. For this example…
Scraping DevTopics.com’s “101 Great Computer Programming Quotes”
This is a nice set of quotes but we can’t quite copy-and-paste them into a .csv file as in doing so each quote is split across multiple rows and begins with its numeric position. I also want to eliminate the quotation marks and parentheses from these quotations as stylistically I tend to avoid them for Twitter.
While we might despair about the orderliness of this page based on this first attempt, make no mistake that there is well-reasoned logic running under the code with its HTML, and we will need to go there instead.
Part I: Scrape
To do this I will load up the rvest package for R and SelectorGadget extension for Chrome.
I want to identify the HTML nodes which hold the quotes we want, then collect that text. To do that, I will initialize the SelectorGadget, then hover and click on the first quote.
In the bottom toolbar we see the value is set as li
, a common HTML tag for items of a list.
Knowing this, we will use the html_nodes
function in R to parse those nodes, then html_text
to extract the text they hold.
Doing this will return a character vector, but I will convert it to a dataframe for ease of manipulation.
Our code thus far is below.
Part II: Clean
Gathering our quotes via rvest versus copying-and-pasting, we get one quote per line, making it more legible to store in our final workbook. We’ve also left the numerical position of each quote. But some issues with the text remain.
First off, looking through the gathered selection of text, I will see that not all text held in the li
node is a quote. This takes some manual intervention to spot, but here I will use dplyr’s slice
function to keep only rows 26 through 126 (corresponding to 100 quotes).
We still want to eliminate the parentheses and quotation markers, and to do this I will use regular expression functions from stringr to replace them.
a. Replace “(“, “)”, and ““” with “”
This is not meant as a comprehensive guide to the notorious regular expression, and if you are not familiar I suggest Chapter 14 of R for Data Science. So I assume some familiarity here as otherwise it becomes quite tedious.
Because “(” and “)” are both metacharacters we will need to escape them. Placing these three characters together with the “or” pipe ( | ) we then use the str_replace_all function to replace strings matching any of the three with nothing “”. |
b. Replace “”” with ” “
The end of a quotation is handled differently as we need a space between the quotation and the author; thus this expression is moved to its own function and we use str_replace
to replace matches with ” “.
Bonus: Set it up for social media
Because I intend to send these quotes to Twitter so I will put a couple finishing touches on here.
First, using the paste
function from base R, I will concatenate our quotes with a couple select hashtags.
Next, I use dplyr’s filter
function to exclude lines that are longer than 240 characters, using another stringr function, str_length
.
The quote for Part II is displayed below.
Finally, find the complete code below.
From web scraping to dataframe manipulation to regular expression, this exercise packs a punch in dealing with real-world unstructured text data — and it comes with some enjoyable reading, too.
I hope this post inspires you to tackle the world of text, and I plan to walk through a couple more of these.
Related