Introduction
This tutorial would walk you through how to scrape data from a table on Wikepedia.The page we would be scraping data from is List of countries and dependencies by population. You can visit the link to a get a feel of how the page looks. The table with data to be scraped is shown below -
Packages used
Csv - A module that is part of python’s standard library for reading and writing data to a file object in csv format.
Beatifulsoup - A library for pulling data out of html and xml files.
Requests - A library for making HTTP requests in python.
Setup
It is assumed that you already have Python 3 installed on your machine if not follow here to install Python 3 on OSx.Run the commands below to install the beatifulsoup and requests library
1 |
|
1 |
|
Scrape the data
Navigate to a specific directory on your machine and run the command below to create a new file named main.py
1 |
|
In the main.py add the following code:
1 |
|
let us break this apart and see how it works line by line.
Lines 1 - 3 Imports all the packages needed to run the application.
Line 6 We define the fuction scrape_data
that takes a url
parameter.
Line 8 We make a get request to the url
using the get method of the requests
library.
When making HTTP requests with the request library, it is important to set a timeout incase the server does not respond in a timely manner. This prevents your program from hanging indefinitely while waiting for a response from the server.
Line 9 We create a beatuful soup tree structure from the content of the response from the server. This object is easy to navigate and search through.
Line 11 We search throught the beatufiful object soup
to find the second table in the document which contains the data we want using the it’s find_all
method. The beautifulsoup object’s find_all
method searches for all html tags that match the filter/search-term in the tree structure.
Line 13 This line of code selects all the tr
elements wherethe parent is a tbody
element from the table
. tr
elements represents the table rows.
Line 15 The first row ussually contains the header cells. We serch throught the first row in the rows list to get the text values of all th
elements in that row.we also ensure to remove the all trailing whitespaces in the text using the rstrip
python string method.
Line 17 - 22 This opens a file and creates a new file object. The w
mode is used to ensure the file is open for writing.First we write the header row, then loop through the rest of the rows ignoring the first row to get the data contained within and write the data for all those rows to the file object.
Line 25 -27 We check to ensure the module is run as the main program and call the function scrape_data
with a specified url
to scrape the data.
on a the terminal run the command below to scrape the data
1 |
|
An output file named output.csv
containing the data should produced in the root folder
Conclusion
Before you begin scraping data from any website, ensure to study the HTML markup/ content of the website to determine the location of the data you want.If you have any questions or comments, please add them to the comments section below.