Get a 2–6x Speed-up on Your Data Pre-processing with Python

By George Seif, AI / Machine Learning Engineer

Python is the go-to programming language for all things machine learning. It’s easy to use and has many fantastic libraries that make crunching data a breeze! But things get a big trickier when we’re dealing with lots of data….

These days, the term “big data” usually refers to a dataset that’s on the order of at least hundreds of thousands if not millions of data points! At such a scale, every little computation adds up and we need to keep efficiency in mind when coding each step of the pipeline. One critical step that is often overlooked when thinking about our machine learning system’s efficiency is the pre-processing stage, where we must apply some kind of operation to all of our data points.

By default, Python programs execute as a single process using a single CPU. Most modern machines made for machine learning have at least 2 CPU cores. That means, for the example of 2 CPU cores, that 50% or more of your computer’s processing power won’t be doing anything by default when you run your pre-processing! The situation gets even worse when you get to 4 cores (modern Intel i5) or 6cores (modern Intel i7).

But thankfully, there is a somewhat hidden feature in a built-in Python library that lets us take advantage of all of our CPU cores! Thanks to Python’s concurrent.futures module, it only takes 3 lines of code to turn a normal program into one that can process data in parallel across the CPU cores.

The standard approach

Let’s take a simple example where we have a dataset of images in a single folder; maybe we even have thousands or millions of those images! For the sake of processing time we’ll use 1000 here. We would like to resize all images to size 600x600 before passing it to our deep neural network. This is what it would look like in some pretty standard Python code you would often see on GitHub.

This program follows a simple pattern you’ll often see in data processing scripts:

You start with a list of files (or other data) that you want to process. You process each piece of data, one at a time, using a for loop and then running the pre-processing on each loop iteration

Let’s test this program on a folder with 1000 jpeg files and see how long it takes to run:

On my i7–8700k with 6 CPU cores, that gave me a run time of 7.9864 seconds! It seems a bit slow for such a high-end CPU. Let’s see what we can do to speed things up.

The fast way

To understand how we would want Python to process things in parallel, it helps to think intuitively about parallel processing itself. Let’s say we have to perform the same single task of hitting nails into a piece of wood and that we have 1000 nails in our bucket. If we say that each nail takes 1 second, then with 1 person we would finish the job in 1000 seconds. But if we have 4 people on the team, we would divide the bucket into 4 equal piles and then each person on the team would work on their own pile of nails. With this method, we would finish in only 250 seconds!

We can have Python do something similar for us in our example here with the 1000 images:

Split the list of jpg files into 4 smaller groups. Run 4 separate instances of the Python interpreter. Have each instance of Python process one of the 4 smaller groups of data. Combine the results from the 4 processes to get the final list of results

The great part about all this is that Python handles all the hard work for us. We just tell it which function we want to run and how many instances of Python to use and it does all the rest! We only have to change 3 lines of code.

From the above code:

boots up as many Python processes as you have CPU cores, in my case 6. The actually processing line is this one:

The executor.map()takes as input the function you would like to run and a list where each element of the list is a single input to our function. Since we have 6 cores, we will be processing 6 items from that list at the same time!

If we again run our program using:

We get a run time of 1.14265 seconds, a nearly x6 speed-up!

Note: There is some overhead in spawning more Python processes and shuffling data around between them, so you won’t always get this much of a speed improvement. But overall your speed-up will usually be quite significant

Is it always super fast?

Using Python parallel pools is a great solution when you have a list of data to process and you are performing a similar computation on each data point. But, it’s not always going to be the perfect solution. The data processed by parallel pools won’t be processed in any predictable order. If you need the result from the processing to be in a specific order, then this method probably isn’t right for you.

The data you are processing also needs to be a type that Python knows how to “pickle”. Luckily, these are quite common. From the official Python documentation:

None, True, and False integers, floating point numbers, complex numbers strings, bytes, bytearrays tuples, lists, sets, and dictionaries containing only picklable objects functions defined at the top level of a module (using def, not lambda) built-in functions defined at the top level of a module classes that are defined at the top level of a module instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section Pickling Class Instances for details).

Like to read about nerd stuff?

Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!

Bio: George Seif is a Certified Nerd and AI / Machine Learning Engineer.

Original. Reposted with permission.

Related: