Metric Mania: Demystifying ML Model Evaluation – Regression

Sometimes the hardest part of deciding what model you want to use, is deciding what metrics are important to you, as they can give you conflicting stories; especially if you’re unsure what they’re trying to tell you!

This is the second part of my two part metric mania blog series. In the first part we covered Classification and in this blog post I will go through the main metrics for Regression models, what they tell you and how to use them.

Regression

In direct contrast with Classification problems where we are looking for prediction of distinct values or categories, predictions for Regression problems are made on a continuous numeric scale. So values can range from -999999999 to 99999999 and beyond either way hypothetically. Due to this the metrics we use on the whole are mathematical in their application, ranging from simpler math to much more complex. We will highlight two quite rudimentary but powerful metrics in Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), as well as one metric which is based in more complex mathematics but as a value is easy to interpret and is probably the most useful as a metric: R².

MAE

Mean Absolute Error is the simplest metric to calculate, simply being the average of the sum of the differences between true value and predicted:

Mean of sum (True - Predicted)

The above graphical summary of MAE, shows it’s true value in terms of how spread your predictions are from perfect (perfect being the central dotted line where predicted = actual). The closer to zero it is, the more even the differences in above and below perfect. The main problem with this is you don’t get a value for the magnitude of the difference, but you do with Root Mean Squared Error.

RMSE

Root Mean Squared Error is the nearly identical to MAE except for an extra couple of steps in the calculation where we square the differences before taking an average and a sum and then square root the output

Square root of the mean of the sum of (True – Predicted) Squared.

Using the same values as the graphical example for MAE, we get the above value of 2.12 for RMSE. As we have squared and then rooted the numbers we are ignoring the arithmetic signs of the numbers and thus see on average how far we are off being perfect both above and below the true values. It is also interesting to see how the two metrics deal with extreme values; an example of which I have shown below.

Adding to the original graphical representation, we have placed two equal extreme values of 12 either side of perfect. We see that this in fact improves the MAE, and rightfully makes RMSE worse. On average the predictions are further away from perfect than before, but we know have a better spread positively and negatively around being perfect.

R²

In essence R² tells us how well a model fits in a range of 0 (poor fit) to 1 (perfect fit); this can be treated as a percentage if you wish (0-100%). But in reality it is telling us the amount of variation in the data trained upon the model is explaining. We won’t go into the mathematics of how we decide how much variation is explained about the model, we just need to know it is and that it’s quite a powerful metric. I say it’s powerful because it’s much easier to understand than the metrics we have already discussed for two reasons:

It’s unit free
it will never be large numbers which are hard to relate to

To put into context why this is important take the following example. We’ve built a model to predict the time taken to complete a project. We get an MAE of 0.5 and RMSE of 2.12 like in the examples above. How happy we are with these numbers depends on the unit of time we are predicting, if it’s seconds then we’re pretty happy. But days, or maybe weeks? That pulls the appropriateness of the model into question. Although technically being alleviated by using appropriate units, we can be working with large numbers can skew our perception with MAE and RMSE. If we are working with predictions in the 1 million and we got an RMSE of 10,000 it may look bad but really it’s a 1% error. There is just more thinking with the other metrics, while R² is understandable from a glance.