A tutorial on tidy cross-validation with R

Let’s load the needed packages:

1
2
3
4
5
library("tidyverse")
library("tidymodels")
library("parsnip")
library("brotools")
library("mlbench")

Load the data, included in the {mlrbench} package:

1
data("BostonHousing2")

I will train a random forest to predict the housing price, which is the cmedv column:

1
head(BostonHousing2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## town tract lon lat medv cmedv crim zn indus chas nox
## 1 Nahant 2011 -70.9550 42.2550 24.0 24.0 0.00632 18 2.31 0 0.538
## 2 Swampscott 2021 -70.9500 42.2875 21.6 21.6 0.02731 0 7.07 0 0.469
## 3 Swampscott 2022 -70.9360 42.2830 34.7 34.7 0.02729 0 7.07 0 0.469
## 4 Marblehead 2031 -70.9280 42.2930 33.4 33.4 0.03237 0 2.18 0 0.458
## 5 Marblehead 2032 -70.9220 42.2980 36.2 36.2 0.06905 0 2.18 0 0.458
## 6 Marblehead 2033 -70.9165 42.3040 28.7 28.7 0.02985 0 2.18 0 0.458
## rm age dis rad tax ptratio b lstat
## 1 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 6.430 58.7 6.0622 3 222 18.7 394.12 5.21

Only keep relevant columns:

1
2
3
boston <- BostonHousing2 %>% 
 select(-medv, -town, -lon, -lat) %>% 
 rename(price = cmedv)

I remove town, lat and lon because the information contained in the column tract is enough.

To train and evaluate the model’s performance, I split the data in two.One data set, which I call the training set, will be further split into two down below. I won’ttouch the second data set, the test set, until the very end.

1
2
3
4
5
train_test_split <- initial_split(boston, prop = 0.9)

housing_train <- training(train_test_split)

housing_test <- testing(train_test_split)

I want to train a random forest to predict price of houses, but random forests have so-calledhyperparameters, which are parameters that cannot be estimated, or learned, from the data. Instead,these parameters have to be chosen by the analyst. In order to choose them, you canuse values from the literature that seemed to have worked well (like is done in Macro-econometrics)or you can further split the train set into two, create a grid of hyperparameter, train the modelon one part of the data for all values of the grid, and compare the predictions of the models on thesecond part of the data. You then stick with the model that performed the best, for example, themodel with lowest RMSE. The thing is, you can’t estimate the true value of the RMSE with onlyone value. It’s like if you wanted to estimate the height of the population by drawing one singleobservation from the population. You need a bit more observations. To approach the true value of theRMSE for a give set of hyperparameters, instead of doing one split, I’ll do 30. I thencompute the average RMSE, which implies training 30 models for each combination of the values of thehyperparameters I am interested in.

First, let’s split the training data again, using the mc_cv() function from {rsample} package.This function implements Monte Carlo cross-validation:

1
validation_data <- mc_cv(housing_train, prop = 0.9, times = 30)

What does validation_data look like?

1
validation_data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # # Monte Carlo cross-validation (0.9/0.1) with 30 resamples 
## # A tibble: 30 x 2
## splits id 
## 
## 1 Resample01
## 2 Resample02
## 3 Resample03
## 4 Resample04
## 5 Resample05
## 6 Resample06
## 7 Resample07
## 8 Resample08
## 9 Resample09
## 10 Resample10
## # ... with 20 more rows

Let’s look further down:

1
validation_data$splits[[1]]
1
## <411/45/456>

The first value is the number of rows of the first set, the second value of the second, and the thirdwas the original amount of values in the training data, before splitting again.

How should we call these two new data sets? The author of {rsample}, Max Kuhn, talks aboutthe analysis and the assessment sets:

Now, in order to continue I need pre-process the data. I will do this in three steps.The first and the second step are used to center and scale the numeric variables and the third stepconverts character and factor variables to dummy variables. This is needed because I will train arandom forest, which cannot handle factor variables directly. Let’s define a recipe to do that,and start by pre-processing the testing set. I write a wrapper function around the recipe,because I will need to apply this recipe to various data sets:

1
2
3
4
5
6
simple_recipe <- function(dataset){
 recipe(price ~ ., data = dataset) %>%
 step_center(all_numeric()) %>%
 step_scale(all_numeric()) %>%
 step_dummy(all_nominal())
}

Once the recipe is defined, I can use the prep() function, which estimates the parameters fromthe data which are needed to process the data. For example, for centering, prep() estimatesthe mean which will then be subtracted from the variables. With bake() the estimates are thenapplied on the data:

1
2
3
testing_rec <- prep(simple_recipe(housing_test), testing = housing_test)

test_data <- bake(testing_rec, newdata = housing_test)

It is important to split the data before using prep() and bake(), because if not, you willuse observations from the test set in the prep() step, and thus introduce knowledge from the testset into the training data. This is called data leakage, and must be avoided. This is why it isnecessary to first split the training data into an analysis and an assessment set, and then alsopre-process these sets separately. However, the validation_data object cannot now be used withrecipe(), because it is not a dataframe. No worries, I simply need to write a function that extractsthe analysis and assessment sets from the validation_data object, applies the pre-processing, trainsthe model, and returns the RMSE. This will be a big function, at the center of the analysis.

But before that, let’s run a simple linear regression, as a benchmark. For the linear regression, I willnot use any CV, so let’s pre-process the training set:

1
2
3
4
5
6
7
8
trainlm_rec <- prep(simple_recipe(housing_train), testing = housing_train)

trainlm_data <- bake(trainlm_rec, newdata = housing_train)

linreg_model <- lm(price ~ ., data = trainlm_data)

broom::augment(linreg_model, newdata = test_data) %>% 
 rmse(price, .fitted)
1
2
3
4
## # A tibble: 1 x 3
## .metric .estimator .estimate
## 
## 1 rmse standard 0.469

broom::augment() adds the predictions to the test_data in a new column, .fitted. I won’tuse this trick with the random forest, because there is no augment() method for random forestsfrom the {ranger} which I’ll use. I’ll add the predictions to the data myself.

Ok, now let’s go back to the random forest and write the big function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
my_rf <- function(mtry, trees, split, id){
 analysis_set <- analysis(split)
 analysis_prep <- prep(simple_recipe(analysis_set), training = analysis_set)
 analysis_processed <- bake(analysis_prep, newdata = analysis_set)
 
 model <- rand_forest(mtry = mtry, trees = trees) %>%
 set_engine("ranger", importance = 'impurity') %>%
 fit(price ~ ., data = analysis_processed)

 assessment_set <- assessment(split)
 assessment_prep <- prep(simple_recipe(assessment_set), testing = assessment_set)
 
 assessment_processed <- bake(assessment_prep, newdata = assessment_set)

 tibble::tibble("id" = id,
 "truth" = assessment_processed$price,
 "prediction" = unlist(predict(model, new_data = assessment_processed)))
}

The rand_forest() function is available from the {parsnip} package. This package provides anunified interface to a lot of other machine learning packages. This means that instead of having tolearn the syntax of range() and randomForest() and, and… you can simply use the rand_forest()function and change the engine argument to the one you want (ranger, randomForest, etc).

Let’s try this function:

1
2
3
results_example <- map2_df(.x = validation_data$splits,
 .y = validation_data$id,
 ~my_rf(mtry = 3, trees = 200, split = .x, id = .y))
1
head(results_example)
1
2
3
4
5
6
7
8
9
## # A tibble: 6 x 3
## id truth prediction
## 
## 1 Resample01 0.0235 -0.104 
## 2 Resample01 -0.135 -0.0906
## 3 Resample01 -0.378 -0.158 
## 4 Resample01 -0.232 0.0623
## 5 Resample01 -0.0859 0.0173
## 6 Resample01 0.169 0.303

I can now compute the RMSE when mtry = 3 and trees = 200:

1
2
3
4
5
results_example %>%
 group_by(id) %>%
 rmse(truth, prediction) %>%
 summarise(mean_rmse = mean(.estimate)) %>%
 pull
1
## [1] 0.4319164

The random forest has already lower RMSE than the linear regression. The goal now is to lower thisRMSE by tuning the mtry and trees hyperparameters. For this, I will use Bayesian Optimizationmethods implemented in the {mlrMBO} package.