Precision and Recall

When measuring the effectiveness of a machine-learning model, accuracy alone is insufficient. This generally holds true regardless of complexity. While a high level of accuracy generally correlates with a well-trained classification algorithm, it doesn’t measure how well the model performs on specific classes, nor does it account for the class imbalances that tend to be present in most real-world data sets.

For instance, imagine a deep learning model intended to identify the presence or absence of cancer based on scans from medical imaging devices. On paper, that model might have an accuracy of 95%. In practice, however, it frequently misdiagnoses or overlooks scans showing evidence of pancreatic cancer.

A confusion matrix helps address this issue, providing researchers with a visual representation of how well a model predicted samples belonging to each class. This allows researchers to see where their model struggles and optimize their training data accordingly. Moreover, it makes it far easier to calculate several critical assessment metrics that provide deeper insight into the model’s performance – precision and recall.

What is Precision?

Represented as a percentage, precision is basically a more focused variant of the accuracy metric. Rather than assessing the number of correct predictions across an entire data set, it examines the correct predictions associated with a single class. For instance, if a model classified 100 samples as belonging to a particular class but only 80 of them actually belonged to that class, the model’s precision score is 80%.

Precision can be applied to both binary-class data sets and multi-class data sets. In the case of the latter, a different precision score can be calculated for each classification.

What is Recall?

Recall, also known as True Positive Rate, measures how accurately a model is able to identify samples belonging to a particular class. It weighs the number of correct predictions associated with a class against the total number of samples belonging to that class. For example, if a test data set consisted of 40 samples belonging to class A and the model correctly identified 30 of them, the model’s recall for that class is 75%.

As with precision, recall can be applied to both binary- and multi-class data sets, with a different recall score for each class.

What’s the Difference Between Precision and Recall?

Because precision and recall are so closely related, it’s easy to conflate them with one another. Before learning how to use precision and recall in machine learning, it’s important to understand how to differentiate the two. This concept can be explained through a simple analogy.

When discussing the risks of password reuse, an employee is asked to recall every account of theirs that doesn’t have a unique password. The employee lists twenty accounts, eight of which have unique passwords. In this scenario, we can assume the employee’s recall is 100% – they identified all twelve accounts with shared credentials

At the same time, the employee also misidentified eight accounts as using shared credentials. As a result, their precision score is only 60%.

Precision and Recall Formula

Before we discuss the formula for calculating precision and recall, some context is necessary. When assessing binary-class data sets, it’s usually easiest to divide the samples those data sets contain into two categories: positives and negatives. From there, each prediction made by the model can be assigned to one of four categories:

  • A True Positive (TP) occurs when a model correctly identifies a sample as belonging to a positive class.
  • A True Negative (TN) occurs when a model correctly identifies a sample as belonging to a negative class.
  • A False Positive (FP) occurs when a model incorrectly identifies a sample as belonging to a positive class.
  • A False Negative (FN) occurs when a model incorrectly identifies a sample as belonging to a negative class.

These categories can then be used to calculate precision and recall as follows:

  • Precision: TP/(TP+FP)
  • Recall: TP/(TP+FN)

When applied to multi-class data sets, the formulae remain more or less the same:

  • Precision: Class A True Positives/(Class A True Positives + Class A False Positives)
  • Recall: Class A True Positives/(Class A True Positives + Class A False Negatives)

For further context, a false negative in a multi-class data set occurs when a machine-learning model incorrectly identifies a sample belonging to the target class as being from a different class. A false positive in a multi-class data set occurs when a machine-learning model incorrectly assigns a sample from another class to the target class. As stated prior, precision and recall may be independently calculated for each class in a multi-class data set.

Let’s say, for instance, that we’ve developed a simple computer vision algorithm to determine the dominant color in a photograph from the primary colors of red, blue, and yellow. We then feed the following data set into the algorithm:

Color Photographs
Red 20
Blue 10
Yellow 10

We then compile the results into a confusion matrix:

Red Blue Yellow
Red 20 2 5
Blue 0 7 2
Yellow 0 1 3

Based on the data above, the model’s overall accuracy is 75% – it identified 30 of the 40 samples correctly. Digging into that data, let’s calculate the precision and recall for photos in which red is the dominant primary color:

Precision (Red): 20/(20+7)= 74%.

Recall (Red): 20(20+0)= 100% (note that perfect recall is exceedingly rare in the real world).

We can perform the same calculations for the other colors as well:

Precision (Blue): 7/(7+2)= 77.8%.

Recall (Blue): 7/(7+3)= 70%.

Precision (Yellow): 3/(3+1)= 75%

Recall (Yellow): 3/(3+7)=30%

We can average out the values above to determine the model’s overall recall and precision. This allows us to determine that the model has an accuracy of 75%, an average precision of 75.6%, and an average recall of 66.67%. The calculations and the confusion matrix above also tell us that:

  • The model has an excellent precision rate when identifying the color red but appears to overcorrect and misidentify other colors as red, as well.
  • The color blue likely doesn’t require much retraining, as it has the highest precision rate.
  • The model requires far more extensive training in identifying photographs with yellow as the dominant primary color.