Machine Learning Classification: A Dataset-based Pictorial

 


The concept of classification in machine learning is concerned with building a model that separates data into distinct classes. This model is built by inputting a set of training data for which the classes are pre-labeled in order for the algorithm to learn from. The model is then used by inputting a different dataset for which the classes are withheld, allowing the model to predict their class membership based on what it has learned from the training set. Well-known classification schemes include decision trees and Support Vector Machines, among a whole host of others. As this type of algorithm requires explicit class labeling, classification is a form of supervised learning.

This is, conceptually, quite intuitive and easy to understand. But the uninitiated may ask how this plays out in real life. In order to relate machine learning classification to the practical, let’s see how this concept plays out, step by step, specifically in relation to a dataset, as we go from a single comma separated value (CSV) file – a common means of storing and feeding data into a machine learning system – to a model which can be used to make predictions.

In our exercise, we will make the following assumptions:

We will use the time-tested adult dataset for our example We will omit the details of discussing any pre-processing of our dataset As such, we will ignore the presence of categorical features in our dataset, which would need to be converted to numeric representation in real life We will also ignore the presence of null values which would also need to be dealt with in real life Finally, we will assume that we are interested in a conventional train/test split of our dataset (as opposed to some holdout method such as cross-validation)

We start with having a look at our raw dataset, which you will find in Figure 1. This includes all of the data which would be necessary to complete a machine learning task. We have taken only the top 25 lines of the CSV file for our example.

Note that you can click on all images to enlarge for a better look.

Figure 1: Raw adult dataset.

We have a set of characteristic variables, or features, which describe an instance of some observation. Rows are the observations, or instances, while columns are features. This is true of all columns excepts the right-most, which is our target, a set of categories which we will attempt to predict by their relevant feature values – this is the essence of classification. The “other” category of supervised learning, regression, is almost identical, conceptually; the sole difference is that, while prediction is intended to learn how to predict for a finite set of categorical values, regression is intended to predict continuous numeric values.

Figure 2 shows our dataset separated into features to the left of the red line, and targets to the right.

Figure 2: Features are to the left of the red line, targets to the right.

Features for a particular instance are grouped together into a feature vector, an example of which is outlined in Figure 3.

Figure 3: A feature vector shown in the context of a full dataset.

An instance is made up of a feature vector and a corresponding target, as shown in Figure 4.

Figure 4: An instance (or observation) encompasses both a feature vector and a corresponding target.

Now that we have defined what the full dataset is made of, we have some decisions to make. As we eventually want to move to modeling our data and using what is learned to classify other data (not the same data we used to build the model with), we will need to separating the dataset into training and testing datasets. This is usually accomplished by splitting dataset at some point, noted by the percentage of the dataset one would like to use for training. In this example, we will use 20 of our data instances for training (80%, a split which makes generally is in the range of what makes sense), and the remaining 5 data instances for testing what we have learned. This split can be shown in Figure 5.

Figure 5: Splitting our dataset into training (above the yellow line) and testing (below the yellow line) sets.

At this point we know enough about our single entity dataset to slice it up into pieces which will be useful for use in our machine learning algorithm. We will require a separation of features and targets in both our training and testing sets (we will overlook the details of ensuring a random split of instances).

Assuming we are working the Python ecosystem (with which I am familiar), such as split can be easily accomplished with a tool such as Scikit-learn’s train_test_split function, a sample of which is shown below:

After invocation, we are left with:

Training feature matrix (X_train), top left (of Figure 6) Training target vector (y_train), top right Testing feature matrix (X_test), bottom left Testing target vector (y_test), bottom right

Figure 6: Our training and testing splits, along with feature and target separation.

Now we need to learn the mapping of features to targets in our training set, in order to apply this mapping to our testing data to see how accurate our model is. This learning process results, shown conceptually in Figure 7, in a function which can then be used afterward. This is the essence of supervised machine learning: feeding in labeled data instances, learning just such a mapping function, and applying this function to data for which labels are not known (or are intentionally withheld).

Figure 7: Going from raw data to useful predictive function, during modeling.

After training, the function can be applied to our testing set, and predictions can be made based on the features in the testing instances (shown in Figure 8).

Figure 8: Using the learned function to make predictions on the testing set.

The last major step would be to measure the effectiveness of our model, via a metric such as accuracy (note that we did not discuss testing vs. validation sets). Our predicted targets during the testing phase would be compared to the ground truth (actual) targets of the testing set, and noted. In practical terms, this would compare the elements of a new vector (say, y_pred) to the elements of the existing y_test vector. This shows us how effective our model was, and gives us a baseline to which we can compare future classification models (functions) learned using alternative algorithms, or even using the same algorithm with different hyperparameter settings.

 Related: