How to update your scikit-learn code for 2018

In 2015, I created a 4-hour video series called Introduction to machine learning in Python with scikit-learn. In the years since, hundreds of thousands of students have watched these videos, and thousands continue to do so every month.

At the time of the recording, I was using Python 2.7 and scikit-learn 0.16. Although the video content remains entirely relevant, some of the code is now outdated due to changes in the language.

I recently updated the Jupyter notebooks shown in the videos to use Python 3.6 and scikit-learn 0.19.1 in order to take advantage of the newest language features. (You can download the updated notebooks from GitHub.) During this process, I documented my changes (below) so that others can have an easier time updating their own code.

Of course, this is not an exhaustive list of all scikit-learn changes, rather it only includes changes that affected my code. The only way you can truly keep up with changes to the library is to read the detailed scikit-learn release notes.

I hope this is helpful to you. Please let me know in the comments section below if you have any questions!

Part 1: scikit-learn changes

Part 2: Python changes

Part 3: Other changes

Model evaluation classes and functions have been moved

What changed: In scikit-learn 0.18, the classes and functions from the cross_validation, grid_search, and learning_curve modules were moved into a new model_selection module.

How to update your code: You need to update the import statements.

Before:

from sklearn.cross_validation import train_test_split 
from sklearn.cross_validation import cross_val_score 
from sklearn.grid_search import GridSearchCV 
from sklearn.grid_search import RandomizedSearchCV 

After:

from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import GridSearchCV 
from sklearn.model_selection import RandomizedSearchCV 

Further reading: Model Selection Enhancements and API Changes

Grid search and randomized search have changed how they report results

What changed: Starting in scikit-learn 0.18, the results of a grid search or randomized search are accessed via the cv_results_ attribute, replacing the grid_scores_ attribute.

Explanation: The grid_scores_ attribute was a list of named tuples, in which each tuple represented the results of testing a single set of parameters. The cv_results_ attribute, on the other hand, is a dictionary of 1D arrays, in which each array represents a single metric (such as mean_test_score) across all sets of parameters. The structure was changed so that the results can easily be converted into a pandas DataFrame, which is especially useful since cv_results_ provides significantly more information about the search results than grid_scores_ did.

How to update your code: You should convert cv_results_ to a DataFrame (as shown below) before exploring the results.

Before:

# view the mean and standard deviation of the test scores for each set of parameters
grid.grid_scores_

# examine the results of the first set of parameters
grid.grid_scores_[0].parameters 
grid.grid_scores_[0].mean_validation_score

# list all of the mean test scores
[result.mean_validation_score for result in grid.grid_scores_]

After:

# convert the search results into a pandas DataFrame
import pandas as pd 
results = pd.DataFrame(grid.cv_results_)

# view the mean and standard deviation of the test scores for each set of parameters
results[['mean_test_score', 'std_test_score', 'params']]

# examine the results of the first set of parameters
results['params'][0] 
results['mean_test_score'][0]

# list all of the mean test scores
results['mean_test_score'] 

Note: The best_estimator_, best_score_, and best_params_ attributes are still available and did not change.

Further reading: Model Selection Enhancements and API Changes

Related notebook: Efficiently searching for optimal tuning parameters

Grid search and randomized search can return training scores

What changed: Starting in scikit-learn 0.18, grid search and randomized search can optionally calculate the training scores for each cross-validation split by setting return_train_score=True. Starting in scikit-learn 0.19.1, the default value of return_train_score was changed from True to 'warn' to alert users that calculating training scores may slow down the search significantly.

Explanation: Calculating the training scores is not required in order to select the best set of parameters, and is only useful for gaining insights on how different parameter settings affect the overfitting/underfitting trade-off.

How to update your code: You should explicitly set return_train_score=False unless you specifically need to calculate the training scores.

Before:

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy') 
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=20) 

After:

grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False) 
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=20, return_train_score=False) 

Further reading: scikit-learn 0.19.1 release notes

Related notebook: Efficiently searching for optimal tuning parameters

Scoring parameters for loss functions have been renamed

What changed: Starting in scikit-learn 0.18, the names of scoring parameters for which “lower is better” are now prefixed by 'neg_', such as 'neg_mean_squared_error'.

Explanation: Some model evaluation metrics (known as “reward functions”) have the property that higher values are better than lower values, such as accuracy, precision, and recall. Other metrics (known as “loss functions”) have the property that lower values are better, such as log loss, mean absolute error, and mean squared error. Because optimization tools such as GridSearchCV are built to maximize the evaluation metric (meaning they always treat higher values as better than lower values), scikit-learn automatically negates the scores any time a loss function is selected as the scoring parameter. The negation of scores still takes places in scikit-learn 0.18 (and beyond), but the affected scoring parameters have been renamed in order to reduce confusion.

How to update your code: Any time you are using a loss function as a scoring parameter, you need to add the 'neg_' prefix to the parameter name. Currently, this includes: 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', and 'neg_median_absolute_error'.

Before:

cross_val_score(linreg, X, y, cv=10, scoring='mean_squared_error') 

After:

cross_val_score(linreg, X, y, cv=10, scoring='neg_mean_squared_error') 

Note: This change only affects classes with a scoring parameter, such as cross_val_score and GridSearchCV. The functions in the metrics module, such as metrics.mean_squared_error, have not been renamed because they continue to output positive scores.

Further reading: The scoring parameter: defining model evaluation rules

Related notebook: Cross-validation for parameter tuning, model selection, and feature selection

Only 2D data arrays can be passed to models

What changed: Starting in scikit-learn 0.17, only 2D data arrays can be passed to models as input. 1D data arrays are no longer accepted.

Explanation: When you pass input data to a model (to fit or predict, for example), the data must now be explicitly shaped (n_samples, n_features). In other words, each row of the array should represent a sample, and each column should represent a feature. Previous to scikit-learn 0.17, you could pass a 1D data array to a model, and it would infer how that array should be interpreted. That is no longer allowed because it can cause confusion about whether the array elements should be interpreted as samples or features.

How to update your code: If you try to pass a list such as [3, 5, 4, 2] to a model, it will be interpreted as a 1D array of shape (4,) and won’t be accepted. If you meant for it to be interpreted as 1 sample with 4 features, then its shape needs to be changed to (1, 4). (Three options are shown below for how to accomplish this.) If you meant for it to be interpreted as 4 samples with 1 feature, then its shape needs to be changed to (4, 1).

Before:

knn.predict([3, 5, 4, 2]) 

After:

# option 1: pass the data as a nested list, which will be interpreted as having shape (1, 4)
knn.predict([[3, 5, 4, 2]])

# option 2: explicitly change the shape to be (1, 4)
import numpy as np 
knn.predict(np.reshape([3, 5, 4, 2], (1, 4)))

# option 3: explicitly change the first dimension to be 1, let NumPy infer that the second dimension should be 4
knn.predict(np.reshape([3, 5, 4, 2], (1, -1))) 

Related notebook: Training a machine learning model with scikit-learn

Print is no longer a statement

What changed: Starting in Python 3, print is a function rather than a statement.

How to update your code: You need to convert your print statements to functions.

Before:

print X.shape 

After:

print(X.shape) 

Further reading: What’s New In Python 3.0

Many Python 3 functions output iterators instead of lists

What changed: Starting in Python 3, the range and zip functions (among others) return iterators instead of lists.

How to update your code: If you need to output a list, you can explicitly convert the output of range and zip using the list function.

Before:

k_range = range(1, 26) 
print(zip(feature_cols, linreg.coef_)) 

After:

k_range = list(range(1, 26)) 
print(list(zip(feature_cols, linreg.coef_))) 

Further reading: Python 3’s range is more powerful than Python 2’s xrange

IPython Notebook is now called Jupyter Notebook

What changed: Starting in late 2015, the official name of the “IPython Notebook” was changed to “Jupyter Notebook”.

Explanation: Originally, IPython was an interactive Python shell, and the IPython Notebook was a browser-based interactive environment that used IPython as its “kernel” (execution environment). Over time, the IPython Notebook gained support for other kernels (such as Julia and R) and thus became language agnostic. The name was changed from “IPython Notebook” to “Jupyter Notebook” to avoid implying that it only supported the Python programming language, though IPython is still the default kernel for the Notebook.

How to update your code: Assuming you have the Jupyter Notebook installed, you should type jupyter notebook at the command line (instead of ipython notebook) to open the Notebook dashboard.

Further reading: The Big Split

Related notebook: Setting up Python for machine learning: scikit-learn and Jupyter Notebook

External datasets have been moved to the GitHub repository

What changed: The code from the video series relied on two external datasets, which have now been moved to the GitHub repository.

Explanation: In the video series, I used two external datasets as examples, and read the files into pandas via URL. One of those files has since been taken offline, and the other file has since been modified, which broke my code. To protect against these problems occurring again, I located the original files, moved them to the GitHub repository, and now refer to them in the code using relative paths.

How to update your code: When reading in the files, refer to them using relative paths (as shown below). Note that this will only work if the data files are on your local machine in a data subdirectory, which can be achieved by cloning or downloading the GitHub repository.

Before:

# read the advertising dataset via URL
url = 'http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv' 
data = pd.read_csv(url, index_col=0)

# read the diabetes dataset via URL
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data' 
pima = pd.read_csv(url, header=None, names=col_names) 

After:

# read the advertising dataset via relative path
path = 'data/Advertising.csv' 
data = pd.read_csv(path, index_col=0)

# read the diabetes dataset via relative path
path = 'data/pima-indians-diabetes.data' 
pima = pd.read_csv(path, header=None, names=col_names)