How to scrape data from a website using Python

Introduction

This tutorial would walk you through how to scrape data from a table on Wikepedia.The page we would be scraping data from is List of countries and dependencies by population. You can visit the link to a get a feel of how the page looks. The table with data to be scraped is shown below -

Packages used

Csv - A module that is part of python’s standard library for reading and writing data to a file object in csv format.

Beatifulsoup - A library for pulling data out of html and xml files.

Requests - A library for making HTTP requests in python.

Setup

It is assumed that you already have Python 3 installed on your machine if not follow here to install Python 3 on OSx.Run the commands below to install the beatifulsoup and requests library

1
2
pip install requests

1
2
pip install beautifulsoup4

Scrape the data

Navigate to a specific directory on your machine and run the command below to create a new file named main.py

1
2
touch main.py

In the main.py add the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import csv
import requests
from bs4 import BeautifulSoup


def scrape_data(url):

 response = requests.get(url, timeout=10)
 soup = BeautifulSoup(response.content, 'html.parser')

 table = soup.find_all('table')[1]

 rows = table.select('tbody > tr')

 header = [th.text.rstrip() for th in rows[0].find_all('th')]

 with open('output.csv', 'w') as csv_file:
 writer = csv.writer(csv_file)
 writer.writerow(header)
 for row in rows[1:]:
 data = [th.text.rstrip() for th in row.find_all('td')]
 writer.writerow(data)


if __name__=="__main__":
 url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
 scrape_data(url)


let us break this apart and see how it works line by line.

Lines 1 - 3 Imports all the packages needed to run the application.

Line 6 We define the fuction scrape_data that takes a url parameter.

Line 8 We make a get request to the url using the get method of the requests library.

When making HTTP requests with the request library, it is important to set a timeout incase the server does not respond in a timely manner. This prevents your program from hanging indefinitely while waiting for a response from the server.

Line 9 We create a beatuful soup tree structure from the content of the response from the server. This object is easy to navigate and search through.

Line 11 We search throught the beatufiful object soup to find the second table in the document which contains the data we want using the it’s find_all method. The beautifulsoup object’s find_all method searches for all html tags that match the filter/search-term in the tree structure.

Line 13 This line of code selects all the tr elements wherethe parent is a tbody element from the table. tr elements represents the table rows.

Line 15 The first row ussually contains the header cells. We serch throught the first row in the rows list to get the text values of all th elements in that row.we also ensure to remove the all trailing whitespaces in the text using the rstrip python string method.

Line 17 - 22 This opens a file and creates a new file object. The w mode is used to ensure the file is open for writing.First we write the header row, then loop through the rest of the rows ignoring the first row to get the data contained within and write the data for all those rows to the file object.

Line 25 -27 We check to ensure the module is run as the main program and call the function scrape_data with a specified url to scrape the data.

on a the terminal run the command below to scrape the data

1
2
python main.py

An output file named output.csv containing the data should produced in the root folder

Conclusion

Before you begin scraping data from any website, ensure to study the HTML markup/ content of the website to determine the location of the data you want.If you have any questions or comments, please add them to the comments section below.