LET’S EVALUATE WITH MACRO F1: WHAT CAN GO WRONG?

Juri Opitz, Sebastian Burst - December 5th, 2019

Macro F1 and macro F1

Earlier this year we got slightly puzzled about how to best calculate the “macro F1” score to measure the performance of a classifier.

To provide a bit of background, the macro F1 metric is frequently used when classes are considered equally important despite their relative frequency. For instance, consider the case of discriminating gemstones found in the dirt. Naturally, only a few of them will be rubies or emeralds. A high accuracy of our classifier does not imply that we’re good at discriminating gemstones in general. Maybe, we’re just good at discriminating the frequent quartz varieties, such as amethyst or onyx. Therefore, we sometimes need a metric that puts equal weight to all classes. Here the macro F1 metric comes into play that aims at treating all classes as equally important.

Now, let’s go back to the beginning of this year. We found a table in a paper where the numbers just didn’t add up. There were 18 classes with 18 individual F1 scores, but the “macro F1” was significantly higher than their arithmetic mean. The authors simply used a different formula to calculate macro F1. More specifically, they computed the harmonic mean (F1 score) of the arithmetic means of all precision and recall scores, as opposed to computing the arithmetic mean over class-wise F1 scores.

Here’s the arithmetic mean over harmonic means (AF):

And here’s the harmonic mean over arithmetic means (FA):

We looked into other papers and found out that both formulas are being used.1  Sometimes it is stated which formula is used and sometimes it is not (which may be really nobody’s fault since people may not be aware that there are two different formulas).

Not only Macro F1 but also the inch metric result(ed) in different scores given the same input. However: the inch metrics always yields the same ranking of inputs. Not so macro F1. Here: Mid-19th century tool for converting between different standards of the inch.

Historically, macro F1 is not the only metric which has caused confusion. Consider the inch metric… there used to be different standards in the 19th century and this lead to some major trouble. Imagine an English person ordering tiles from Amsterdam just to find out later that they are all too short to be of use! The Romans already did that better! So we dug a bit deeper to analyze the exact difference in the two scores and the implications for classifier evaluation. For example, you might ask, perhaps the two computations are almost equivalent? Or, at least, when one formula ranks classifier A higher than B, so does the other? Different inch metrics might lead to different scores, but they always yield the same ranking when applied to different objects. Sounds trivial, but that’s a good thing! If that were the case for our two macro F1 formulas, we should not worry (too) much. Unfortunately, the two macro F1 formulas

  1. Can differ greatly
  2. Do not necessarily produce the same classifier ranking.

S0… what metric should we use?

A recent blog post also noted that there are two macro F1 scores, yet it falls short of a deeper analysis and avoids answering the question which one to use. Our analysis shows:2

  1. FA is always greater or equal than AF.
  2. They are equivalent only in the rare circumstance that for every class: precision=recall.
  3. The difference can be as high as 0.5 (50 percentage points).3
  4. In average cases, the difference can be as high as 0.02 (2 percentage points).4
  5. FA rewards classifiers which produce skewed error type distributions.
  6. This is very likely to happen on imbalanced data sets, but small differences are possible even on balanced data sets.

Below are a few examples to illustrate the point. In all tables, entry ij is the number of data points labelled i by the classifier that have gold label j. Table 1: When classes have high recall and low precision, or low recall and high precision, the individual F1 score for every class will be low. Hence, AF (their mean) will also be low. FA, however, might be quite high. Tables 2,3: Introducing a bias towards class B improves one metric and impairs the other, resulting in different classifier rankings. Note that the data set is balanced.

Table 1: AF=0.02, FA=0.5
A B
A 100 10,000
B 0 100
Table 2: AF=0.4, FA=0.41
A B C
A 3,500 2,500 1,500
B 5,000 5,000 5,000
C 1,500 2,500 3,500
Table 3: AF=0.36, FA=0.47
A B C
A 2,000 1,000 0
B 8,000 8,000 8,000
C 0 1,000 2,000

So let us pose some final questions: Does it make sense to use two metrics interchangeably when they produce different rankings? And is it justifiable that a macro F1 metric yields a high score, even though class-wise F1 scores are low? We don’t think so and suggest that AF (the arithmetic mean of class-wise F1 scores)  is to be preferred when evaluating classifiers. However, if there is one conclusion to draw from all this it is the following: let’s always state the formula we are using. This prevents any confusion and ensures proper comparison between classifiers.

  1. Here are some examples where we know what is being used: Rudinger et al. (2018), Santos et al. (2011), Opitz and Frank (2019) compute the harmonic mean over arithmetic means. This is what is recommended in this blog-post. And here are some that calculate the arithmetic mean over class-wise harmonic means: Wu and Zhou (2017), Lipton et al. (2014), Rosenthal et al. (2015).  This is what’s implemented in scikit-learn.
  2. Opitz and Burst (2019), Macro F1 and Macro F1
  3. The proof is fairly elaborate, since the well-known inequality for harmonic and arithmetic means does not hold for composite functions. If there is a shorter proof, please let us know! (opitz or burst at cl.uni-heidelberg.de) See also mathoverflow, math.stackexchange.
  4. See “Numerical analysis” in Opitz and Burst (2019), Macro F1 and Macro F1 for a detailed analysis of average cases.