Understanding Regression Error Metrics

Human brains are built to recognize patterns in the world around us. For example, we observe that if we practice our programming everyday, our related skills grow. But how do we precisely describe this relationship to other people? How can we describe how strong this relationship is? Luckily, we can describe relationships between phenomena, such as practice and skill, in terms of formal mathematical estimations called regressions.

Regressions are one of the most commonly used tools in a data scientist’s kit. When you learn Python or R, you gain the ability to create regressions in single lines of code without having to deal with the underlying mathematical theory. But this ease can cause us to forget to evaluate our regressions to ensure that they are a sufficient enough representation of our data.

We can plug our data back into our regression equation to see if the predicted output matches corresponding observed value seen in the data. The quality of a regression model is how well its predictions match up against actual values, but how do we actually evaluate quality?

Luckily, smart statisticians have developed error metrics to judge the quality of a model and enable us to compare regresssions against other regressions with different parameters. These metrics are short and useful summaries of the quality of our data. This article will dive into four common regression metrics and discuss their use cases.

There are many types of regression, but this article will focus exclusively on metrics related to the linear regression. The linear regression is the most commonly used model in research and business and is the simplest to understand, so it makes sense to start developing your intuition on how they are assessed. The intuition behind many of the metrics we’ll cover here extend to other types of models and their respective metrics.

If you’d like a quick refresher on the linear regression, you can consult this fantastic blog post or the Linear Regression Wiki page.

In the context of regression, models refer to mathematical equations used to describe the relationship between two variables. In general, these models deal with prediction and estimation of values of interest in our data called outputs. Models will look at other aspects of the data called inputs that we believe to affect the outputs, and use them to generate estimated outputs. These inputs and outputs have many names that you may have heard before. Inputs are can also be called independent variables or predictors, while outputs are also known as responses or dependent variables. Simply speaking, models are just functions where the outputs are some function of the inputs.

The linear part of linear regression refers to the fact that a linear regression model is described mathematically in the form:

If that looks too mathematical, take solace in that linear thinking is particularly intuitive. If you’ve ever heard of “practice makes perfect,” then you know that more practice means better skills; there is some linear relationship between practice and perfection.

The regression part of linear regression does not refer to some return to a lesser state. Regression here simply refers to the act of estimating the relationship between our inputs and outputs. In particular, regression deals with the modelling of continuous values (think: numbers) as opposed to discrete states (think: categories).

Taken together, a linear regression creates a model that assumes a linear relationship between the inputs and outputs. The higher the inputs are, the higher (or lower, if the relationship was negative) the outputs are.

What adjusts how strong the relationship is and what the direction of this relationship is between the inputs and outputs are our coefficients. The first coefficient without an input is called the intercept, and it adjusts what the model predicts when all your inputs are 0.

We will not delve into how these coefficients are calculated, but know that there exists a method to calculate the optimal coefficients, given which inputs we want to use to predict the output. Given the coefficients, if we plug in values for the inputs, the linear regression will give us an estimate for what the output should be.

As we’ll see, these outputs won’t always be perfect. Unless our data is a perfectly straight line, our model will not precisely hit all of our data points. One of the reasons for this is the ϵ (named “epsilon”) term. This term represents error that comes from sources out of our control, causing the data to deviate slightly from their true position.

Our error metrics will be able to judge the differences between prediction and actual values, but we cannot know how much the error has contributed to the discrepancy. While we cannot ever completely eliminate epsilon, it is useful to retain a term for it in a linear model.

Since our model will produce an output given any input or set of inputs, we can then check these estimated outputs against the actual values that we tried to predict. We call the difference between the actual value and the model’s estimate a residual. We can calculate the residual for every point in our data set, and each of these residuals will be of use in assessment. These residuals will play a significant role in judging the usefulness of a model.

If our collection of residuals are small, it implies that the model that produced them does a good job at predicting our output of interest. Conversely, if these residuals are generally large, it implies that model is a poor estimator.

We technically can inspect all of the residuals to judge the model’s accuracy, but unsurprisingly, this does not scale if we have thousands or millions of data points. Thus, statisticians have developed summary measurements that take our collection of residuals and condense them into a single value that represents the predictive ability of our model.

There are many of these summary statistics, each with their own advantages and pitfalls. For each, we’ll discuss what each statistic represents, their intution and typical use case. We’ll cover:

Mean Absolute Error
Mean Square Error
Mean Absolute Percentage Error
Mean Percentage Error

Note: Even though you see the word error here, it does not refer to the epsilon term from above! The error described in these metrics refer to the residuals!

In discussing these error metrics, it is easy to get bogged down by the various acronyms and equations used to describe them. To keep ourselves grounded, we’ll use a model that I’ve created using the Video Game Sales Data Set from Kaggle.

The specifics of the model I’ve created are shown below.

My regression model takes in two inputs (critic score and user score), so it is a multiple variable linear regression. The model took in my data and found that 0.039 and -0.099 were the best coefficients for the inputs. For my model, I chose my intercept to be zero since I’d like to imagine there’d be zero sales for scores of zero. Thus, the intercept term is crossed out. Finally, the error term is crossed out because we do not know its true value in practice. I have shown it because it depicts a more detailed description of what information is encoded in the linear regression equation.

Rationale behind the model

Let’s say that I’m a game developer who just created a new game, and I want to know how much money I will make. I don’t want to wait, so I developed a model that predicts total global sales (my output) based on an expert critic’s judgment of the game and general player judgment (my inputs). If both critics and players love the game, then I should make more money… right? When I actually get the critic and user reviews for my game, I can predict how much glorious money I’ll make.

Currently, I don’t know if my model is accurate or not, so I need to calculate my error metrics to check if I should perhaps include more inputs or if my model is even any good!

The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals. If you’re unfamiliar with the mean, you can refer back to this article on descriptive statistics. The formal equation is shown below:

The picture below is a graphical description of the MAE. The green line represents our model’s predictions, and the blue points represent our data.

The MAE is also the most intuitive of the metrics since we’re just looking at the absolute difference between the data and the model’s predictions. Because we use the absolute value of the residual, the MAE does not indicate underperformance or overperformance of the model (whether or not the model under or overshoots actual data). Each residual contributes proportionally to the total amount of error, meaning that larger errors will contribute linearly to the overall error.

Like we’ve said above, a small MAE suggests the model is great at prediction, while a large MAE suggests that your model may have trouble in certain areas. A MAE of 0 means that your model is a perfect predictor of the ouputs (but this will almost never happen).

While the MAE is easily interpretable, using the absolute value of the residual often is not as desirable as squaring this difference. Depending on how you want your model to treat outliers, or extreme values, in your data, you may want to bring more attention to these outliers or downplay them. The issue of outliers can play a major role in which error metric you use.

Calculating MAE against our model

Calculating MAE is relatively straightforward in Python. In the code below, sales contains a list of all the sales numbers, and X contains a list of tuples of size 2. Each tuple contains the critic score and user score corresponding to the sale in the same index. The lm contains a LinearRegression object from scikit-learn, which I used to create the model itself. This object also contains the coefficients. The predict method takes in inputs and gives the actual prediction based off those inputs.

# Perform the intial fitting to get the LinearRegression object
from sklearn import linear_model
lm = linear_model.LinearRegression()
lm.fit(X, sales)

mae_sum = 0
for sale, x in zip(sales, X):
 prediction = lm.predict(x)
 mae_sum += abs(sale - prediction)
mae = mae_sum / len(sales)

print(mae)
>>> [ 0.7602603 ]

Our model’s MAE is 0.760, which is fairly small given that our data’s sales range from 0.01 to about 83 (in millions).

The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. We can see this difference in the equation below.

Consequences of the Square Term

Because we are squaring the difference, the MSE will almost always be bigger than the MAE. For this reason, we cannot directly compare the MAE to the MSE. We can only compare our model’s error metrics to those of a competing model.

The effect of the square term in the MSE equation is most apparent with the presence of outliers in our data. While each residual in MAE contributes proportionally to the total error, the error grows quadratically in MSE. This ultimately means that outliers in our data will contribute to much higher total error in the MSE than they would the MAE. Similarly, our model will be penalized more for making predictions that differ greatly from the corresponding actual value. This is to say that large differences between actual and predicted are punished more in MSE than in MAE. The following picture graphically demonstrates what an indivdual residual in the MSE might look like.

Outliers will produce these exponentially larger differences, and it is our job to judge how we should approach them.

The problem of outliers

Outliers in our data are a constant source of discussion for the data scientists that try to create models. Do we include the outliers in our model creation or do we ignore them? The answer to this question is dependent on the field of study, the data set on hand and the consequences of having errors in the first place.

For example, I know that some video games achieve a superstar status and thus have disproportionately higher earnings. Therefore, it would be foolish of me to ignore these outlier games because they represent a real phenomenon within the data set. I would want to use the MSE to ensure that my model takes these outliers into account more. If I wanted to downplay their significance, I would use the MAE since the outlier residuals won’t contribute as much to the total error as MSE.

Ultimately, the choice between is MSE and MAE is application-specific and depends on how you want to treat large errors. Both are still viable error metrics, but will describe different nuances about the prediction errors of your model.

A note on MSE and a close relative

Another error metric you may encounter is the root mean squared error (RMSE). As the name suggests, it is the square root of the MSE. Because the MSE is squared, its units do not match that of the original output. Researchers will often use RMSE to convert the error metric back into similar units, making interpretation easier.

Since the MSE and RMSE both square the residual, they are similarly affected by outliers. The RMSE is analogous to the standard deviation (MSE to variance) and is a measure of how large your residuals are spread out.

Both MAE and MSE can range from 0 to positive infinity, so as both of these measures get higher, it becomes harder to interpret how well your model is performing. Another way we can summarize our collection of residuals is by using percentages so that each prediction is scaled against the value it’s supposed to estimate.

Calculating MSE against our model

Like MAE, we’ll calculate the MSE for our model. Thankfully, the calculation is just as simple as MAE.

mse_sum = 0
for sale, x in zip(sales, X):
 prediction = lm.predict(x)
 mse_sum += (sale - prediction)**2
mse = mse_sum / len(sales)

print(mse)
>>> [ 3.53926581 ]

With the MSE, we would expect it to be much larger than MAE due to the influence of outliers. We find that this is the case: the MSE is an order of magnitude higher than the MAE. The corresponding RMSE would be about 1.88, indicating that our model misses actual sale values by about $1.8M.

The mean absolute percentage error (MAPE) is the percentage equivalent of MAE. The equation looks just like that of MAE, but with adjustments to convert everything into percentages.

Just as MAE is the average magnitude of error produced by your model, the MAPE is how far the model’s predictions are off from their corresponding outputs on average. Like MAE, MAPE also has a clear interpretation since percentages are easier for people to conceptualize. Both MAPE and MAE are robust to the effects of outliers thanks to the use of absolute value.

However for all of its advantages, we are more limited in using MAPE than we are MAE. Many of MAPE’s weaknesses actually stem from use division operation. Now that we have to scale everything by the actual value, MAPE is undefined for data points where the value is 0. Similarly, the MAPE can grow unexpectedly large if the actual values are exceptionally small themselves. Finally, the MAPE is biased towards predictions that are systematically less than the actual values themselves. That is to say, MAPE will be lower when the prediction is lower than the actual compared to a prediction that is higher by the same amount. The quick calculation below demonstrates this point.

We have a measure similar to MAPE in the form of the mean percentage error. While the absolute value in MAPE eliminates any negative values, the mean percentage error incorporates both positive and negative errors into its calculation.

Calculating MAPE against our model

mape_sum = 0
for sale, x in zip(sales, X):
 prediction = lm.predict(x)
 mape_sum += (abs((sale - prediction))/sale)
mape = mape_sum/len(sales)

print(mape)
>>> [ 5.68377867 ]

We know for sure that there are no data points for which there are zero sales, so we are safe to use MAPE. Remember that we must interpret it in terms of percentage points. MAPE states that our model’s predictions are, on average, 5.6% off from actual value.

The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation.

Even though the MPE lacks the absolute value operation, it is actually its absence that makes MPE useful. Since positive and negative errors will cancel out, we cannot make any statements about how well the model predictions perform overall. However, if there are more negative or positive errors, this bias will show up in the MPE. Unlike MAE and MAPE, MPE is useful to us because it allows us to see if our model systematically underestimates (more negative error) or overestimates (positive error).

If you’re going to use a relative measure of error like MAPE or MPE rather than an absolute measure of error like MAE or MSE, you’ll most likely use MAPE. MAPE has the advantage of being easily interpretable, but you must be wary of data that will work against the calculation (i.e. zeroes). You can’t use MPE in the same way as MAPE, but it can tell you about systematic errors that your model makes.

Calculating MPE against our model

mpe_sum = 0
for sale, x in zip(sales, X):
 prediction = lm.predict(x)
 mpe_sum += ((sale - prediction)/sale)
mpe = mpe_sum/len(sales)

print(mpe)
>>> [-4.77081497]

All the other error metrics have suggested to us that, in general, the model did a fair job at predicting sales based off of critic and user score. However, the MPE indicates to us that it actually systematically underestimates the sales. Knowing this aspect about our model is helpful to us since it allows us to look back at the data and reiterate on which inputs to include that may improve our metrics.

Overall, I would say that my assumptions in predicting sales was a good start. The error metrics revealed trends that would have been unclear or unseen otherwise.

We’ve covered a lot of ground with the four summary statistics, but remembering them all correctly can be confusing. The table below will give a quick summary of the acronyms and their basic characteristics.

Acroynm	Full Name	Residual Operation?	Robust To Outliers?
MAE	Mean Absolute Error	Absolute Value	Yes
MSE	Mean Squared Error	Square	No
RMSE	Root Mean Squared Error	Square	No
MAPE	Mean Absolute Percentage Error	Absolute Value	Yes
MPE	Mean Percentage Error	N/A	Yes

All of the above measures deal directly with the residuals produced by our model. For each of them, we use the magnitude of the metric to decide if the model is performing well. Small error metric values point to good predictive abillity, while large values suggest otherwise. That being said, it’s important to consider the nature of your data set in choosing which metric to present. Outliers may change your choice in metric, depending on if you’d like to give them more significance to the total error. Some fields may just be more prone to outliers, while others are may not see them so much.

In any field though, having a good idea of what metrics are available to you is always important. We’ve covered a few of the most common error metrics used, but there are others that also see use. The metrics we covered use the mean of the residuals, but the median residual also sees use. As you learn other types of models for your data, remember that intution we developed behind our metrics and apply them as needed.

If you’d like to explore the linear regression more, Dataquest offers an excellent course on its use and application! We used scikit-learn to apply the error metrics in this article, so you can read the docs to get a better look at how to use them!