Machine Learning Classification Metrics
I haven't talked much about machine model classification metrics aside from the confusion matrix. Let's do a short rehash on that and add on some more useful metrics that are derived from the confusion matrix.
Confusion Matrix
Despite its name, it's actually pretty simple to understand. Let's assume we have three groups, tall, medium, and short, we are trying to classify. Let's arbitrarily label the tall group positive and any group that isn't tall is negative. This means the medium and short groups will be labeled negative. True positives (TP) are the number of correctly classified observations e.g. the number of correctly predicted observations that are tall. True negatives (TN) are the number of correctly rejected observations, e.g. the number of correctly predicted observations that aren't tall. False positives (FP) are the number of incorrectly classified observations e.g. the number of incorrectly predicted observations that are tall. False negatives (FN) are the number of incorrectly rejected observations, e.g. the number of incorrectly predicted observations that aren't tall.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
Other Metrics
Alright so we got the confusion matrix down! It looks pretty helpful but it would be even more helpful if we could further quantify our model's performance.
| Name | Description | Equation | 
|---|---|---|
| TR | Number of Correct Positive Predictions | NA | 
| TN | Number of Correct Negative Predictions | NA | 
| FP | Number of Incorrect Positive Predictions | NA | 
| FN | Number of Incorrect Negative Predictions | NA | 
| Sensitivity (Recall) | Proportion of Correct Positive Predictions | $\frac {TP}{TP+FN}$ | 
| Specificity | Proportion of Correct Negative Predictions | $\frac {TN}{TN+FP}$ | 
| Accuracy | Percent of Correctly Predicted Observations | $\frac {TP + TN}{TP + TN + FP + FN}$ | 
| Balanced Accuracy | Unbiased Accuracy | $\frac {Sens + Spec}{2}$ | 
| Precision (PPV) | Proportion of True Positives | $\frac {TP}{TP+FP}$ | 
| Negative Predictive Value (NPV) | Proportion of True Negatives | $\frac {TN}{TN+FN}$ | 
| F1 | Harmonic Mean of Sensitivity and PPV | $\frac {2 * PPV * Sens}{PPV + Sens}$ | 
Sensitivity (Recall)
Sensitivity tells us how well our model did at predicting the number of observations who were actually tall. High sensitivity means our model is good finding observations that are tall and doesn't have many false negatives.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
Specificity
Specificity tells us how well our model did at predicting the number of observations that weren't tall. High specificity means our model is good finding observations that aren't tall with few false positives.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
Accuracy
Accuracy is the number of correct predictions (TP & TN) divided by all the predictions made by the model. This will give a percentage out of 100. I generally don't use accuracy because it becomes extremely biased when the groups you are trying to predict are equal. I.e. if tall has 60 observations, medium has 30 observations, and small has 10 observations, accuracy will be reliable in cases like this. This is a problem I run into a lot with public clinical neuroimaging datasets.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
Balanced Accuracy
Balanced Accuracy is not biased by unequal groups like accuracy is. It does this by taking the average of specificity and sensitivity.$$BalancedAccuracy = \frac {Specificity + Sensitivity}{2}$$
Precision (PPV)
Precision tells us how well model did at predicting true observations e.g. how many tall observations are there actually. High precision means our model is good at finding tall observations and doesn't have many false positives.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
Negative Predictive Value (NPV)
NPV tells us how well model did at predicting false observations e.g. how many non-tall observations are there actually. High NPV means our model is good finding observations that aren't tall and doesn't have many false negatives
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
F1
A high F1 will mean your model is good at identifying tall observations while not having many false positive or false negatives.
| Actual Positive | Actual Negative | |
|---|---|---|
| Predicted Positive | TP | FP | 
| Predicted Negative | FN | TN | 
What Metric Matters?
What metrics you use to tell how good your model is, is dependent on the problem you're trying to solve. For the example in this post, trying to identify tall observations, precision and recall would be good to use, or the combination of both, the F1 metric. Which in my opinion, in most cases, is the most useful metric. High precision would mean our model is good at identifying tall observations and isn't incorrectly identifying non-tall observations as tall. High recall would mean our model is good at identifying tall observations, without incorrectly identifying tall observations. So a high F1 will mean your model is good at identifying tall observations while not having many false positive or false negatives. Again, this is very dependent on the problem you're solving and low metrics in some of measures are acceptable in different contexts.