It’s a naive advice for real beginners, however I’m sure I will copypaste snippets from here over and over again.
Let’s imagine, we need to take some shared data (e.g. dataframe) and do a lot of similar computations. Looks like a nice candidate for parallel computation, right?
Of course, most straight-forward way to get the result is for loop (or list comprehension, which are pretty similar). Is it fast? Not sure…
Easiest way to parallelize is using multiprocessing.dummy.Pool
for thread-based parallelizing and multiprocessing.Pool
for process-based parallelizing.
Let’s start with threads:
No speedup in this synthetic example: more time wasted to thread switching, and efficiency sucks as the data frame is locked by GIL.OK, trying processes:
Fail. BTW, that’s not the only limitation related to pickle
(e.g. you can’t pickle local objects).
Should we become sad at the moment? Nope, we have brilliant joblib
- btw it’s a library that serves most parallel computations in your favorite scikit-learn
.
Still two lines, and the last one may look a bit uncommon. But who cares, if one can load all CPUs on their dev server and receive the results faster?