Note: This tutorial is available as a video series and a Jupyter notebook, and the dataset of lies is available as a CSV file.
Summary
This an introductory tutorial on web scraping in Python. All that is required to follow along is a basic understanding of the Python programming language.
By the end of this tutorial, you will be able to scrape data from a static web page using the requests and Beautiful Soup libraries, and export that data into a structured text file using the pandas library.
Outline
What is web scraping?
On July 21, 2017, the New York Times updated an opinion article called Trump’s Lies, detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text. This is a great format for human consumption, but it can’t easily be understood by a computer. In this tutorial, we’ll extract the President’s lies from the New York Times article and store them in a structured dataset.
This is a common scenario: You find a web page that contains data you want to analyze, but it’s not presented in a format that you can easily download and read into your favorite data analysis tool. You might imagine manually copying and pasting the data into a spreadsheet, but in most cases, that is way too time consuming. A technique called web scraping is a useful way to automate this process.
What is web scraping? It’s the process of extracting information from a web page by taking advantage of patterns in the web page’s underlying code. Let’s start looking for these patterns!
Examining the New York Times article
Here’s the way the article presented the information:
When converting this into a dataset, you can think of each lie as a “record” with four fields:
-
The date of the lie.
-
The lie itself (as a quotation).
-
The writer’s brief explanation of why it was a lie.
-
The URL of an article that substantiates the claim that it was a lie.
Importantly, those fields have different formatting, which is consistent throughout the article: the date is bold red text, the lie is “regular” text, the explanation is gray italics text, and the URL is linked from the gray italics text.
Why does the formatting matter? Because it’s very likely that the code underlying the web page “tags” those fields differently, and we can take advantage of that pattern when scraping the page. Let’s take a look at the source code, known as HTML:
Examining the HTML
To view the HTML code that generates a web page, you right click on it and select “View Page Source” in Chrome or Firefox, “View Source” in Internet Explorer, or “Show Page Source” in Safari. (If that option doesn’t appear in Safari, just open Safari Preferences, select the Advanced tab, and check “Show Develop menu in menu bar”.)
Here are the first few lines you will see if you view the source of the New York Times article:
Let’s locate the first lie by searching the HTML for the text “iraq”:
Thankfully, you only have to understand three basic facts about HTML in order to get started with web scraping!
Fact 1: HTML consists of tags
You can see that the HTML contains the article text, along with “tags” (specified using angle brackets) that “mark up” the text. (“HTML” stands for Hyper Text Markup Language.)
For example, one tag is <strong>
, which means “use bold formatting”. There is a <strong>
tag before “Jan. 21” and a </strong>
tag after it. The first is an “opening tag” and the second is a “closing tag” (denoted by the /
), which indicates to the web browser where to start and stop applying the formatting. In other words, this tag tells the web browser to make the text “Jan. 21” bold. (Don’t worry about the
- we’ll deal with that later.)
Fact 2: Tags can have attributes
HTML tags can have “attributes”, which are specified in the opening tag. For example, indicates that this particular
tag has a class
attribute with a value of short-desc
.
For the purpose of web scraping, you don’t actually need to understand the meaning of ``, class
, or short-desc
. Instead, you just need to recognize that tags can have attributes, and that they are specified in this particular way.
Fact 3: Tags can be nested
Let’s pretend my HTML code said:
Hello <strong><em>Data School</em> students</strong>
The text Data School students would be bold, because all of that text is between the opening <strong>
tag and the closing </strong>
tag. The text Data School would also be in italics, because the <em>
tag means “use italics”. The text “Hello” would not be bold or italics, because it’s not within either the <strong>
or <em>
tags. Thus, it would appear as follows:
Hello Data School students
The central point to take away from this example is that tags “mark up” text from wherever they open to wherever they close, regardless of whether they are nested within other tags.
Got it? You now know enough about HTML in order to start web scraping!
Reading the web page into Python
The first thing we need to do is to read the HTML for this article into Python, which we’ll do using the requests library. (If you don’t have it, you can pip install requests
from the command line.)
1 |
|
The code above fetches our web page from the URL, and stores the result in a “response” object called r
. That response object has a text
attribute, which contains the same HTML code we saw when viewing the source from our web browser:
1 |
|
<!DOCTYPE html> <html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!–[if IE 9]> <html lang=”en” class=”no-js ie9 lt-ie10 page-interactive section-opinion page
1 |
|
from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, ‘html.parser’)
1 |
|
results = soup.find_all(‘span’, attrs={‘class’:’short-desc’})
1 |
|
len(results)
1 |
|
There are 116 results, which seems reasonable given the length of the article. (If this number did not seem reasonable, we would examine the HTML further to determine if our assumptions about the patterns in the HTML were incorrect.)
We can also slice the object like a list, in order to examine the first three results:
1 |
|
[Jan. 21 “I wasn’t a fan of Iraq. I didn’t want to go into Iraq.” (He was for an invasion before he was against it.), Jan. 21 “A reporter for Time magazine — and I have been on their cover 14 or 15 times. I think we have the all-time record in the history of Time magazine.” (Trump was on the cover 11 times and Nixon appeared 55 times.), Jan. 23 “Between 3 million and 5 million illegal votes caused me to lose the popular vote.” (There’s no evidence of illegal voting.)]
1 |
|
results[-1]
1 |
|
Looks good!
We have now collected all 116 of the records, but we still need to separate each record into its four components (date, lie, explanation, and URL) in order to give the dataset some structure.
Web scraping is often an iterative process, in which you experiment with your code until it works exactly as you desire. To simplify the experimentation, we’ll start by only working with the first record in the results
object, and then later on we’ll modify our code to use a loop:
1 |
|
Jan. 21 “I wasn’t a fan of Iraq. I didn’t want to go into Iraq.” (He was for an invasion before he was against it.)
1 |
|
first_result.find(‘strong’)
1 |
|
This code searches first_result
for the first instance of a <strong>
tag, and again returns a Beautiful Soup “Tag” object (not a string).
Since we want to extract the text between the opening and closing tags, we can access its text
attribute, which does in fact return a regular Python string:
1 |
|
‘Jan. 21\xa0’
1 |
|
first_result.find(‘strong’).text[0:-1]
1 |
|
Finally, we’re going to add the year, since we don’t want our dataset to include ambiguous dates:
1 |
|
‘Jan. 21, 2017’
1 |
|
first_result
1 |
|
Our goal is to extract the two sentences about Iraq. Unfortunately, there isn’t a pair of opening and closing tags that starts immediately before the lie and ends immediately after the lie. Therefore, we’re going to have to use a different technique:
1 |
|
[Jan. 21 , ““I wasn’t a fan of Iraq. I didn’t want to go into Iraq.” “, (He was for an invasion before he was against it.)]
1 |
|
first_result.contents[1]
1 |
|
Finally, we’ll slice off the curly quotation marks as well as the extra space at the end:
1 |
|
“I wasn’t a fan of Iraq. I didn’t want to go into Iraq.”
1 |
|
first_result.contents[2]
1 |
|
The second option is to search for the surrounding tag, like we did when extracting the date:
1 |
|
(He was for an invasion before he was against it.)
1 |
|
first_result.find(‘a’).text[1:-1]
1 |
|
Finally, we want to extract the URL of the article that substantiates the writer’s claim that the President was lying.
Let’s examine the <a>
tag within first_result
:
1 |
|
(He was for an invasion before he was against it.)
1 |
|
first_result.find(‘a’)[‘href’]
1 |
|
Recap: Beautiful Soup methods and attributes
Before we finish building the dataset, I want to summarize a few ways you can interact with Beautiful Soup objects.
You can apply these two methods to either the initial soup
object or a Tag object (such as first_result
):
-
find()
: searches for the first matching tag, and returns a Tag object -
find_all()
: searches for all matching tags, and returns a ResultSet object (which you can treat like a list of Tags)
You can extract information from a Tag object (such as first_result
) using these two attributes:
-
text
: extracts the text of a Tag, and returns a string -
contents
: extracts the children of a Tag, and returns a list of Tags and strings
It’s important to keep track of whether you are interacting with a Tag, ResultSet, list, or string, because that affects which methods and attributes you can access.
And of course, there are many more methods and attributes available to you, which are described in the Beautiful Soup documentation.
Building the dataset
Now that we’ve figured out how to extract the four components of first_result
, we can create a loop to repeat this process on all 116 results
. We’ll store the output in a list of tuples called records
:
1 |
|
Since there were 116 results
, we should have 116 records
:
1 |
|
116
1 |
|
records[0:3]
1 |
|
Looks good!
Applying a tabular data structure
The last major step in this process is to apply a tabular data structure to our existing structure (which is a list of tuples). We’re going to do this using the pandas library, an incredibly popular Python library for data analysis and manipulation. (If you don’t have it, here are the installation instructions.)
The primary data structure in pandas is the “DataFrame”, which is suitable for tabular data with columns of different types, similar to an Excel spreadsheet or SQL table. We can convert our list of tuples into a DataFrame by passing it to the DataFrame constructor and specifying the desired column names:
1 |
|
The DataFrame includes a head()
method, which allows you to examine the top of the DataFrame:
1 |
|
The numbers on the left side of the DataFrame are known as the “index”, which act as identifiers for the rows. Because we didn’t specify an index, it was automatically assigned as the integers 0 to 115.
We can examine the bottom of the DataFrame using the tail()
method:
1 |
|
Did you notice that “January” is abbreviated, while “July” is not? It’s best to format your data consistently, and so we’re going to convert the date column to pandas’ special “datetime” format:
1 |
|
The code above converts the “date” column to datetime format, and then overwrites the existing “date” column. (Notice that we did not have to tell pandas that the column was originally in “MONTH DAY, YEAR” format - pandas just figured it out!)
Let’s take a look at the results:
1 |
|
df.tail()
1 |
|
df.to_csv(‘trump_lies.csv’, index=False, encoding=’utf-8’)
1 |
|
df = pd.read_csv(‘trump_lies.csv’, parse_dates=[‘date’], encoding=’utf-8’)
1 |
|
import requests r = requests.get(‘https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html’)
from bs4 import BeautifulSoup soup = BeautifulSoup(r.text, ‘html.parser’) results = soup.find_all(‘span’, attrs={‘class’:’short-desc’})
records = [] for result in results: date = result.find(‘strong’).text[0:-1] + ‘, 2017’ lie = result.contents[1][1:-2] explanation = result.find(‘a’).text[1:-1] url = result.find(‘a’)[‘href’] records.append((date, lie, explanation, url))
import pandas as pd df = pd.DataFrame(records, columns=[‘date’, ‘lie’, ‘explanation’, ‘url’]) df[‘date’] = pd.to_datetime(df[‘date’]) df.to_csv(‘trump_lies.csv’, index=False, encoding=’utf-8’)
1 |
|
search for a tag by name
first_result.find(‘strong’)
shorter alternative: access it like an attribute
first_result.strong
1 |
|
You can also search for multiple tags a few different ways:
1 |
|
For more details, check out the Beautiful Soup documentation.
P.S. Want to be the first to know when I release new Python tutorials? Subscribe to the Data School newsletter.