Measurement Metrics for ML model evaluation

10 min readMar 3, 2021

Model measure should fit for the purpose. One has to choose the right metric which cater to the problem

In today’s world we have abundance of data being generated from various systems. These data are the footprints of various system performance or their character. We would like to predict or monitor the data generating system (or performance) using machine learning (ML) models. These models have their background in Statistics and Probability theory which provides a solid theoretical framework to learn the pattern from data. But wait, how do we know we are learning ‘right’ or ‘correct’ pattern. So, when making an assessment between various model hypothesis and their trained instantiation, how do you know which one is performing the ‘best’ for the use case. Here comes the metrics. Like any scientific learning, ML learning also goes through rigorous validation checks to make sure the inductive learning made by the model is actually good. Since there is no perfect model but some models are useful we need to decide the useful models based on the measurement criteria which I am going to explain below.

The usual task of a machine learning model is to learn a set of Y given a set of X variables. We are interested in learning Y given the X, i.e., Y=f(X). X and Y one or set of vectors of observations. For X matrix [xij] each row (subscript i) is one observation across variables and each column (subscript j) refers one variable or measurement. Corresponding y refers the target value, which is being modelled. The dataset has n values marked y1,…,yn (collectively known as yi or as a vector Y= [y1,…,yn].transpose()) with each associated with each associated with a fitted (or modeled, or predicted) value f1,…,fn (known as fi, or sometimes ŷi, as a vector f). We get this fitted value after applying ML hypothesis function on X.

There are generally three broad kind of machine learning problems

1. Regression: When the output (Y) is continuous

2. Classification: When output (Y) is nominal or class variable(s)

3. Reinforcement learning: When model makes decision based on the policy and state of the system. For metrics discussion we shall not discuss this here. I shall write another article on RL measurement metrics.

1. Regression Problems:

1. R Square/Adjusted R Square

2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)

3. Mean Absolute Error(MAE)

4. Mean Absolute Percent Error(MAPE)

R Square/Adjusted R Square: R square (also called coefficient of detemination) defines how much of variability can be defined by the independent variables. Actually Total Variance in data= Variance explained by regression + Residual Variance or unexplained varience by model

R square =1−MSE/var(y)=1−SSres/SStot=1−(SStot−SSreg)/SStot=1−1+SSreg/SStot=SSreg/SStot.

MSE- Mean squared error of residuals (yi-fi); fi=h(xi) and h is the model

We can see that if we have more variables and we keep on adding that will always increase R square value and thus give a false notion that model is improving but we are increasing the variance and make the model complex. To safeguard from this we use adjusted R-square which penalises the metric with every new variable addition in model to allow us evaluate regression model more objectively. It has range between 0 and 1. 1 being perfect determination and higher value indicates better model fit.

Adjusted R square and vanila R square relation

SSres- Residual sum of square., SStot — Total sum of squares, n- Total number of observation in dataset and p number of explanatory variables used in the model. Increase in p will reduce adjusted Rsquared. So adding more variables are penalised which keeps model variance at check.

from sklearn.metrics import r2_score
 
r_square = r2_score(y_test, y_pred)   
adjusted_r2 = 1-(1-r2_score(y_test, y_pred))*((len(X_test)-1)/(len(X_test)-len(X_test[0])-1))

In modern regression problems where we have millions of observations and less number of variables i.e. n/p>>>1 i.e ((n-1)/(n-p-1) -> 1) both measures yield close values.

Mean Square Error(MSE)/Root Mean Square Error(RMSE): There is another way to directly evaluation model performance. It is infact used in computation and derivation of adjusted R squared.

If we square root this measure then that metric is called root mean squared error. As it is averaged by number of observation, it does not change with increase in observation, but value can be any positive real number. If we standardise the y variable then value remains in range (0,1)

from sklearn.metrics import mean_squared_error
import math
mse = mean_squared_error(Y_test, Y_predicted)
rmse = math.sqrt(mean_squared_error(Y_test, Y_predicted))

Mean Absolute Error(MAE): If we do not square the prediction error and rather take the mod or difference without sign (i.e. absolute value) then it is called MAE

If y is not standardised then MSE may be better metric than MAE. MSE puts larger penalty on the large prediction error or mistakes.

from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(Y_test, Y_predicted)

Mean Absolute Percent Error(MAPE): Calculates the mean absolute percentage error (Deviation) function for the forecast and the eventual outcomes. MAPE is also referred to as MAPD

This is used for time series prediction problems.

{xi} is the actual observations time series

{x̂ i} is the estimated or forecasted time series

N is the number of non-missing data points

2. Classifications Problems:

Classification problems are very important problem in the field of ML and pattern recognition. We will discuss that in details along with comparisons of the concepts and relative tradeoff. This will help us to pick the right metrics for the appropriate use case.

1. Confusion Matrix

2. Accuracy, Recall, Precision

3. F1 Score

4. Log Loss

5. ROC AUC & PR AUC

6. Concordance and Discordance

7. Kolmogorov Smirnov Statistic

8. ROC AUC vs Accuracy

9. F1 score vs Accuracy

10. PR AUC vs ROC AUC

11. F1 score vs ROC AUC

Confusion Matrix: It is confusion of predicted outcome of the model and the actual values. From the name it can be ‘confusing’ but hold on it is quite intuitive to understand.

TP — True Positive

FP — False Positive

FN — False Negative

TN — True Negative

The table is filled with all the counts as explained above. We can see that TP and TN are correct predictions. TP and TN are when model ‘accurately’ predicts positive and negative respectively. FP is when model ‘inaccurately’ predicts positive but in actual it is negative and finally FN is case when actually negative class is ‘inaccurately’ predicted as positive

#Calculate confusion matrix
from sklearn.metrics import confusion_matrix
matrix=confusion_matrix(y_test, y_predicted)

2. Accuracy, Recall and Precision: From the above table we can quickly infer that Accuracy = (TP+TN)/(TP+TN+FP+FN). Accuracy is a good measure in case you do not have any class imbalances i.e. none of the predicted class is under or over represented. However in case of rare class problem it can be misleading. For example there is only 0.05% of chance of event of interest (actual class 1- 0.05% and class 0- 99.5%) and if we build a naive model and predict everything is 0, our accuracy will be measured as 99.5%, which is misleading.

Recall = TP/(TP+FN) , This measure is used when we aim to capture maximum number of positive cases. Here denominator is actual positive i.e. out of all the positives how many the model is able to label. As per the earlier example recall of our naive model is 0

Precision = TP/(TP+FP), This measures out of the predicted values how many are correct. Here our naive model will provide 0 score as none has been predicted as positive. In usually cases it has tradeoff with recall. If you want to improve precision you actually end up reducing recall. We should optimize both precision and recall metrics together. We shall see various methods to do so in following discussion.

#Calculate Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_predicted)#Calculate Recall
from sklearn.metrics import recall_score
recall_score(y_test, y_predicted)#Calculate Precision
from sklearn.metrics import precision_score
precision_score(y_test, y_predicted)

3. F1 Score: One straight forward method to optimise precision and recall is to take their harmonic mean (HM). This HM between precision and recall are called F1 score. F1 = 2 * (precision * recall)/(precision + recall)

#calculate F1 score
from sklearn.metrics import f1_score
f1_score(y_test,y_predicted)

F1 is a generalised case of F-beta which allows us to take harmonic combination and not only mean. beta-square =1 makes it F1 score.

4. Log Loss/ Binary Cross Entropy: Along with misclassification of labels this also takes probability of misclassification into account.

We can see that log loss value can range from 0 to infinity. If one randomly assigns labels with 50% probability of 1 or 0 then log loss value comes in then range of 0.69 , so it is imperative to make this value atleast less than this during ML training process.

# Calculate log loss
from sklearn.metrics import log_loss
log_loss(y_test,y_score)

4. ROC AUC & PR AUC: ROC AUC is a chart that visualizes the tradeoff between true positive rate (TPR) and false positive rate (FPR). TPR= Recall = TP/(TP+FN) and FPR = FP/(FP+FN). This metric is not affected by class imbalance problem.

Also sensitivity = Recall= TPR and Specificity = FN/(FP+FN) = 1-FPR. So you might see in some cases y axis of ROC curve as sensitivity and x axis as (1- specificity)

We take various threshold of probability cut off and compute the TPR and FPR . If we plot them then we get the curve. We pick up the threshold which increases TPR and not increase FPR. High area under the curve (AUC) i.e. area below the blue line in the picture , signifies good fit. The value ranges from 0 to 1. Random fit will have AUC 0.5

Similar to ROC AUC we can define precision-recall PR-AUC curve. This curve shows precision (PPV) and Recall (TPR) in a single visualization. PPV = TP/(TP+FP) and TPR=TP/(TP+FN)

Carefully examining this curve will allow us to pick up a threshold to optimise both precision and recall value. Threshold after which recall falls fast. This can be considered as average precision for each recall threshold

from sklearn.metrics import roc_auc_score  
roc_auc = roc_auc_score(y_true, y_pred_pos)from sklearn.metrics import average_precision_score 
pr_auc = average_precision_score(y_true, y_pred_pos)

6. Concordance and Discordance: In ideal situation we would like all the probability values associated with 1 is higher than probabilities associated with 0. If we pair by picking 2 from all of the observations, we shall see that some follows that above mentioned property hence ‘concordant’ while other cases probability predicted for actual 1 is less than probability predicted for actual 0 and hence ‘discordant’. There may be some ‘tie’ cases as well i.e same probabilities for actual 1 and 0 pair. In perfect model concordant should be 100% i.e. concordant pairs out of all pairs. Higher concordance means better fit.

7. Kolmogorov Smirnov Statistic: This statistic allows us to pick up the percentthreshold of population which allows us the maximize the separation between predicted 1 and 0. The usage of this is found in campaign optimisation. KS statistic is computed as the maximum difference between the cumulative percentage of 1’s (cumulative true positive rate) and cumulative percentage of 0’s (cumulative false positive rate)

Usage Comparison:

8.ROC AUC vs Accuracy: If we are doing predictions where both positive and negative important then accuracy is better. We should balance the data to get a fair accuracy estimate. If we want to just rank the observations then ROC AUC is better measure

9. F1 Score vs Accuracy: Again similar argument like last. If data is balanced and we are interested in both 1 and 0 then accuracy makes sense. F1 score is combination of precision and recall which make great sense for imbalance problem.

10. ROC AUC vs PR AUC: If you are only interested about the positive class then PR-AUC is the metric. ROC-AUC cares about both positive and negative class.

11. F1 score vs ROC AUC: For highly imbalanced class F1 score gives good idea about positive class as it is a combination of precision and recall and both of which are defined on the positive prediction. However if the data is less imbalanced and ranking is an important outcome then we should pick ROC-AUC.

12. KS Statistic: In marketing segmentation when we may need to pick he top percentage of the population which is most likely to be 1 then we use KS Statistic.

Conclusion:

Picking the right metric is always an art. However it is expected that research scientist should try with different metrics based on the modelling goal. We design our modelling objective and optimize on these choice of metrics to pick the best model/hypothesis which explains the data best.

Measurement Metrics for ML model evaluation

1. Regression Problems:

2. Classifications Problems:

Usage Comparison:

Conclusion:

Written by Mithun Ghosh

No responses yet