The Machine Learning Project Checklist

I find the activity of codifying and comparing various interpretations of a particular process in the pursuit of strengthening one’s own interpretation of said process to be a worthy one. I have previously done so with alternate interpretations of what we could call the machine learning process (and which could reasonably be closely aligned with the data science or data mining processes, at least to some degree), of which you can find examples here and here and here.

These previous posts have considered the classic CRISP-DM model, the KDD Process, Francois Chollet’s 4 step model (aimed at Keras, but generalizable), Yufeng Guo’s 7 steps to machine learning, and even modifications aimed specifically at more narrow disciplines, such as the text-based data science task framework. In an effort to further refine our internal models, this post will present an overview of Aurélien Géron’s Machine Learning Project Checklist, as seen in his bestselling book, “Hands-On Machine Learning with Scikit-Learn & TensorFlow.” It’s a similar approach to that of, say, Guo’s 7 step process, but at a subtly higher level; it’s presented as a checklist of approaching projects, and so it feels less prescriptive and more descriptive, a reminder of what you should be doing as you do it as opposed to some grand explanation of why you are doing what you are doing. This is an observation without judgment.

And so below is a brief overview of Géron’s checklist. I humbly suggest that anyone not yet having read Géron’s book check it out for more useful information aimed at beginners and practitioners alike.

1. Frame the problem

This first step is where the objective is defined. Géron refers to objectives in business terms, but this isn’t strictly necessary. However, an understanding of how the machine learning system’s solution will ultimately be used is important. This step is also where comparable scenarios and current workarounds to a given problem are discussed, as well as assumptions being contemplated, and the degree of need for human expertise determined. Other key technical items to frame in this step include determining which type of machine learning problem (supervised, unsupervised, etc.) applies, and adopting appropriate performance metric(s).

2. Get the data

This step is data-centric: determine how much data is needed, what type of data is needed, where to get the data, assess legal obligations surrounding data acquisition… and get the data. Once you have the data, ensure it is appropriately anonymized, make certain you know what type of data it actually is (time series, observations, images, etc.), convert the data to a format you require of it, and create training, validation, and testing sets as warranted.

3. Explore the data

This step in the checklist is akin to what is often referred to as Exploratory Data Analysis (EDA). The goal is to try and gain insights from the data prior to modeling. Recall that in the first step assumptions about the data were to be identified and explored; this is a good time to more deeply investigate these assumptions. Human experts can be of particular use in this step, answering questions about correlations which may not be obvious to the machine learning practitioner. Studying features and their characteristics is done here, as is general visualization of features and their values (think of how much easier it is, for example, to quickly identify outliers by box plot than by numerical interrogation). Documenting the findings of your exploration for later use is good practice.

4. Prepare the data

Time to apply data transformations you identified as being worthy in the previous step. This step also includes any data cleaning you would perform, as well as both feature selection and engineering. Any feature scaling for value standardization and/or normalization would occur here as well.

5. Model the data

Time to model the data, and whittle the initial set of models down to what appear to be the most promising bunch. (This is similar to the first modeling step in Chollet’s process: good model → “too good” model, which you can read more about here) Such attempts may involve using samples of the full dataset to facilitate training times for preliminary models, models which should cut across a wide spectrum of categories (trees, neural networks, linear, etc.). Models should be built, measured, and compared to one another, and the types of errors made for each model should be investigated, as should the most significant features for each algorithm used. The best performing models should be shortlisted, which can then be fine-tuned afterwards.

6. Fine-tune the models

The shortlisted models should now have their hyperparameters fine-tuned, and ensemble methods should be investigated at this stage. Full datasets should be used during this step, should dataset samples have been used in the previous modeling phase; no fine-tuned model should be selected as the “winner” without having been exposed to all training data or compared with other models which have also been exposed to all training data. Also, you didn’t overfit, right?

7. Present the solution

Time to present, so hopefully your visualization skills (or those of someone on the implementation team) are up to par! This is a much less technical step, though ensuring proper documentation of the technical aspects of the system at this point is also important. Answer questions for interested parties: Do interested parties understand the big picture? Does the solution achieve the objective? Have you conveyed assumptions and limitations? This is essentially a sales pitch, so ensure the takeaway is confidence in the system. Why do all this work if the result isn’t understood and adopted?

8. Launch the ML system

Get the machine learning system ready for production; it will need to be plugged into some wider production system or strategy. As a software solution, it will be exposed to unit testing prior, and should be adequately monitored once up and running. Retraining models on fresh or updated data is part of this process and should be taken into account here, even if thought had been given to this in an earlier step.

Related: