F1 Score

In order to effectively train a machine learning model, you need to periodically assess its performance. You need to verify both its output and its capacity for generalization. Otherwise, you simply cannot trust that it does what it’s supposed to do.

Evaluation metrics play a crucial role in these assessments. They allow you to assign a tangible score to how well your model performs, which you can then use to identify where you should focus further training. Accuracy is the most common of these metrics.

Ironically, it also offers the least accurate measure of performance when training a classification model. Precision and recall are a far better option, breaking the model’s performance down by class and making it significantly easier to identify potential problems with the training data. Not only that, they can be combined to calculate a model’s class-wide performance, generating a metric known as an F1 score.

What is an F1 Score in Machine Learning?

An F1 score is a metric for evaluating a machine learning model that blends precision and recall to provide a better and more focused idea of that model’s predictive performance. You might think of it as a more fine-tuned version of accuracy. While the latter evaluates the model as a whole, the former assesses each class on an individual basis.

It does this by combining two class-specific metrics: precision and recall. Before we go further, let’s quickly review them:

  • Precision assesses the correct predictions associated with a class.
  • Recall looks at how accurately a model is able to identify members of that class.

When calculating precision and recall, one typically assigns a model’s predictions into one of four categories:

  • True Positives(TP), which indicate that the model correctly identified a data point as part of the target class.
  • True Negatives (TN), which indicate the model correctly identified a data point as part of a class other than the target class.
  • False Positives (FP), which indicate that the model incorrectly identified a data point as part of the target class.
  • False Negatives (FN), which indicate that the model incorrectly identified a data point as not belonging to the target class.

The formulae for precision and recall are as follows:

  • Precision: TP/(TP+FP)
  • Recall: TP/(TP+FN)

It might help to consider all the metrics above in the context of a simple analogy.

Imagine that, as part of a survey, someone is asked to list every horror movie they’ve watched in the past year. They list ten movies, but only eight of them are horror films. In this example, we can assume that the moviegoer’s recall is 100% — they successfully remembered every horror movie they viewed. However, they also listed two films that didn’t belong to the horror genre, meaning their precision is only 80%.

Although they appear closely related to one another, precision and recall are often opposing metrics in that a model may sacrifice recall to improve precision and vice-versa. The trick lies in finding the balance between the two metrics. That balance is precisely what a class’s F1 score represents.

F1 Score Formula

As mentioned, we need to combine precision and recall to get an F1 score. The calculation for this is as follows:

F1 = 2 * [(Precision * Recall)/(Precision + Recall)]

If, for some reason, you want to calculate a model’s F1 score without first calculating precision and recall, you can instead use the following formula:

F1 Score = 100 * [TP/TP + 0.5 * (FP + FN)]

The value this formula outputs, expressed as a percentage, is known as a harmonic mean. Whereas an arithmetic mean is simply the average value of the numbers in a series, the harmonic mean measures how similar those values are to one another. The higher the harmonic mean, the closer its constituent values.

It’s also worth noting that while the arithmetic mean tends to favor the higher numbers in a series, the harmonic mean favors the lower number. In other words, if a model’s precision and recall are both high, its F1 score will also be high. For instance, let’s say you want to calculate a model’s performance on two classes with the following precision and recall values:

  • Class A has a precision of 74% and a recall of 72%.
  • Class B has a precision of 93% and a recall of 80%.

F1Class A = 2 *  [(74 * 72)/(74 + 72)] = 72.9%

F1Class B = 2 *  [(93 * 80)/(93 + 80)] = 86.01%.

It’s important to note that, like any metric, an F1 score should not be looked at in a vacuum. It’s only one of several metrics used to evaluate a model’s performance. You should always contextualize it with other metrics.

You should also be cognizant of its shortcomings:

  • Being a single summary value, it offers no insight into error distribution.
  • It assumes that both precision and recall have equal weight, which won’t necessarily be the case with every distribution model.
  • It lacks symmetry, meaning that any changes in dataset labeling can have a significant impact on the metric.
  • Because it disregards true negatives, it tends to be misleading when working with imbalanced classes.
  • It’s primarily intended to measure the performance of binary classification models. Extending it to multi-class data sets requires that you use either a macro- or micro-averaged F1 score, both of which can be somewhat complicated to calculate.

What Does a Model’s F1 Score Mean?

Generally speaking, the closer a model’s F1 score is to 100%, the better the model’s overall performance. Consequently, lower F1 scores indicate poor performance. This may have various causes, including:

  • An imbalanced data set in which one class is significantly overrepresented compared to other classes.
  • An inadequately sized data set.
  • A data set that doesn’t contain enough representative samples for each class.
  • A model that’s poorly suited for the current classification problem.
  • Poor feature selection.

You can achieve a high F1 score by:

  • Using high-quality training data. This could require you to use data augmentation or synthetic data to fill gaps in your data set or even generate a data set from scratch.
  • Ensuring that you choose the right model architecture for your classification problem.
  • Tuning the model’s hyperparameters.
  • Choosing/creating the right features during feature engineering.