Previously, I’ve written about the basics of scraping and how you can find API calls in order to fetch data that isn’t easily downloadable.
For simplicity, the code in these posts has always been synchronous – given a list of URLs, we process one, then the next, then the next, and so on. While this makes for code that’s straight-forward, it can also be slow.
This doesn’t have to be the case though. Scraping is often an example of code that is embarrassingly parallel. With some slight changes, our tasks can be done asynchronously, allowing us to process more than one URL at a time.
In version 3.2, Python introduced the concurrent.futures
module, which is a joy to use for parallelizing tasks like scraping. The rest of this post will show how we can use the module to make our previously synchronous code asynchronous.
Parallelizing your tasks
Imagine we have a list of several thousand URLs. In previous posts, we’ve always written something that looks like this:
The above is an example of synchronous code – we’re looping through a list of URLs, processing one at a time. If the list of URLs is relatively small or we’re not concerned about execution time, there’s little reason to parallelize these tasks – we might as well keep things simple and wait it out.
However, sometimes we have a huge list of URLs – at least several thousand – and we can’t wait hours for them to finish.
With concurrent.futures
, we can work on multiple URLs at once by adding a ProcessPoolExecutor
and making a slight change to how we fetch our results.
But first, a reminder: if you’re scraping, don’t be a jerk. Space out your requests appropriately and don’t hammer the site (i.e. use time.sleep
to wait briefly between each request and set max_workers
to a small number). Being a jerk runs the risk of getting your IP address blocked – good luck getting that data now.
In the above code, we’re submitting tasks to the executor – four workers – each of which will execute the parse
function against a URL. This execution does not happen immediately. For each submission, the executor returns an instance of a Future
, which tells us that our task will be executed at some point in the … well, future. The as_completed
function watches our future_results
for completion, upon which we’ll be able to fetch each result via the result
method.
My favorite part about this module is the clarity of its API – tasks are submitted to an executor, which is made up of one or more workers, each of which is churning through our tasks. Because our tasks are executed asynchronously, we are not waiting for a given task’s completion before submitting another – we are doing so at-will, with completion happening in the future. Once completed, we can get the task’s result.
Closing up
With a few changes to your code and some concurrent.futures
love, you no longer have to fetch those basketball stats one page at a time.
But don’t be a jerk either.