In 2015, I created a 4-hour video series called Introduction to machine learning in Python with scikit-learn. In the years since, hundreds of thousands of students have watched these videos, and thousands continue to do so every month.
At the time of the recording, I was using Python 2.7 and scikit-learn 0.16. Although the video content remains entirely relevant, some of the code is now outdated due to changes in the language.
I recently updated the Jupyter notebooks shown in the videos to use Python 3.6 and scikit-learn 0.19.1 in order to take advantage of the newest language features. (You can download the updated notebooks from GitHub.) During this process, I documented my changes (below) so that others can have an easier time updating their own code.
Of course, this is not an exhaustive list of all scikit-learn changes, rather it only includes changes that affected my code. The only way you can truly keep up with changes to the library is to read the detailed scikit-learn release notes.
I hope this is helpful to you. Please let me know in the comments section below if you have any questions!
Contents
Part 1: scikit-learn changes
Part 2: Python changes
Part 3: Other changes
Model evaluation classes and functions have been moved
What changed: In scikit-learn 0.18, the classes and functions from the cross_validation
, grid_search
, and learning_curve
modules were moved into a new model_selection
module.
How to update your code: You need to update the import statements.
Before:
1 |
|
After:
1 |
|
Further reading: Model Selection Enhancements and API Changes
Grid search and randomized search have changed how they report results
What changed: Starting in scikit-learn 0.18, the results of a grid search or randomized search are accessed via the cv_results_
attribute, replacing the grid_scores_
attribute.
Explanation: The grid_scores_
attribute was a list of named tuples, in which each tuple represented the results of testing a single set of parameters. The cv_results_
attribute, on the other hand, is a dictionary of 1D arrays, in which each array represents a single metric (such as mean_test_score
) across all sets of parameters. The structure was changed so that the results can easily be converted into a pandas DataFrame, which is especially useful since cv_results_
provides significantly more information about the search results than grid_scores_
did.
How to update your code: You should convert cv_results_
to a DataFrame (as shown below) before exploring the results.
Before:
1 |
|
After:
1 |
|
Note: The best_estimator_
, best_score_
, and best_params_
attributes are still available and did not change.
Further reading: Model Selection Enhancements and API Changes
Related notebook: Efficiently searching for optimal tuning parameters
Grid search and randomized search can return training scores
What changed: Starting in scikit-learn 0.18, grid search and randomized search can optionally calculate the training scores for each cross-validation split by setting return_train_score=True
. Starting in scikit-learn 0.19.1, the default value of return_train_score
was changed from True
to 'warn'
to alert users that calculating training scores may slow down the search significantly.
Explanation: Calculating the training scores is not required in order to select the best set of parameters, and is only useful for gaining insights on how different parameter settings affect the overfitting/underfitting trade-off.
How to update your code: You should explicitly set return_train_score=False
unless you specifically need to calculate the training scores.
Before:
1 |
|
After:
1 |
|
Further reading: scikit-learn 0.19.1 release notes
Related notebook: Efficiently searching for optimal tuning parameters
Scoring parameters for loss functions have been renamed
What changed: Starting in scikit-learn 0.18, the names of scoring parameters for which “lower is better” are now prefixed by 'neg_'
, such as 'neg_mean_squared_error'
.
Explanation: Some model evaluation metrics (known as “reward functions”) have the property that higher values are better than lower values, such as accuracy, precision, and recall. Other metrics (known as “loss functions”) have the property that lower values are better, such as log loss, mean absolute error, and mean squared error. Because optimization tools such as GridSearchCV
are built to maximize the evaluation metric (meaning they always treat higher values as better than lower values), scikit-learn automatically negates the scores any time a loss function is selected as the scoring parameter. The negation of scores still takes places in scikit-learn 0.18 (and beyond), but the affected scoring parameters have been renamed in order to reduce confusion.
How to update your code: Any time you are using a loss function as a scoring parameter, you need to add the 'neg_'
prefix to the parameter name. Currently, this includes: 'neg_log_loss'
, 'neg_mean_absolute_error'
, 'neg_mean_squared_error'
, 'neg_mean_squared_log_error'
, and 'neg_median_absolute_error'
.
Before:
1 |
|
After:
1 |
|
Note: This change only affects classes with a scoring
parameter, such as cross_val_score
and GridSearchCV
. The functions in the metrics
module, such as metrics.mean_squared_error
, have not been renamed because they continue to output positive scores.
Further reading: The scoring
parameter: defining model evaluation rules
Related notebook: Cross-validation for parameter tuning, model selection, and feature selection
Only 2D data arrays can be passed to models
What changed: Starting in scikit-learn 0.17, only 2D data arrays can be passed to models as input. 1D data arrays are no longer accepted.
Explanation: When you pass input data to a model (to fit or predict, for example), the data must now be explicitly shaped (n_samples, n_features)
. In other words, each row of the array should represent a sample, and each column should represent a feature. Previous to scikit-learn 0.17, you could pass a 1D data array to a model, and it would infer how that array should be interpreted. That is no longer allowed because it can cause confusion about whether the array elements should be interpreted as samples or features.
How to update your code: If you try to pass a list such as [3, 5, 4, 2]
to a model, it will be interpreted as a 1D array of shape (4,)
and won’t be accepted. If you meant for it to be interpreted as 1 sample with 4 features, then its shape needs to be changed to (1, 4)
. (Three options are shown below for how to accomplish this.) If you meant for it to be interpreted as 4 samples with 1 feature, then its shape needs to be changed to (4, 1)
.
Before:
1 |
|
After:
1 |
|
Related notebook: Training a machine learning model with scikit-learn
Print is no longer a statement
What changed: Starting in Python 3, print
is a function rather than a statement.
How to update your code: You need to convert your print
statements to functions.
Before:
1 |
|
After:
1 |
|
Further reading: What’s New In Python 3.0
Many Python 3 functions output iterators instead of lists
What changed: Starting in Python 3, the range
and zip
functions (among others) return iterators instead of lists.
How to update your code: If you need to output a list, you can explicitly convert the output of range
and zip
using the list
function.
Before:
1 |
|
After:
1 |
|
Further reading: Python 3’s range is more powerful than Python 2’s xrange
IPython Notebook is now called Jupyter Notebook
What changed: Starting in late 2015, the official name of the “IPython Notebook” was changed to “Jupyter Notebook”.
Explanation: Originally, IPython was an interactive Python shell, and the IPython Notebook was a browser-based interactive environment that used IPython as its “kernel” (execution environment). Over time, the IPython Notebook gained support for other kernels (such as Julia and R) and thus became language agnostic. The name was changed from “IPython Notebook” to “Jupyter Notebook” to avoid implying that it only supported the Python programming language, though IPython is still the default kernel for the Notebook.
How to update your code: Assuming you have the Jupyter Notebook installed, you should type jupyter notebook
at the command line (instead of ipython notebook
) to open the Notebook dashboard.
Further reading: The Big Split
Related notebook: Setting up Python for machine learning: scikit-learn and Jupyter Notebook
External datasets have been moved to the GitHub repository
What changed: The code from the video series relied on two external datasets, which have now been moved to the GitHub repository.
Explanation: In the video series, I used two external datasets as examples, and read the files into pandas via URL. One of those files has since been taken offline, and the other file has since been modified, which broke my code. To protect against these problems occurring again, I located the original files, moved them to the GitHub repository, and now refer to them in the code using relative paths.
How to update your code: When reading in the files, refer to them using relative paths (as shown below). Note that this will only work if the data files are on your local machine in a data
subdirectory, which can be achieved by cloning or downloading the GitHub repository.
Before:
1 |
|
After:
1 |
|