#### Macro F1 and macro F1

Earlier this year we got slightly puzzled about how to best calculate the “macro F1” score to measure the performance of a classifier.

To provide a bit of background, the macro F1 metric is frequently used when classes are considered equally important despite their relative frequency. For instance, consider the case of discriminating gemstones found in the dirt. Naturally, only a few of them will be rubies or emeralds. A high accuracy of our classifier does not imply that we’re good at discriminating gemstones in general. Maybe, we’re just good at discriminating the frequent quartz varieties, such as amethyst or onyx. Therefore, we sometimes need a metric that puts equal weight to all classes. Here the macro F1 metric comes into play that aims at treating all classes as equally important.

Now, let’s go back to the beginning of this year. We found a table in a paper where the numbers just didn’t add up.
There were 18 classes with 18 individual F1 scores, but the “macro F1” was **significantly higher**
than their arithmetic mean. The authors simply used a different formula to calculate macro F1. More specifically,
they computed the harmonic mean (F1 score) of the arithmetic means of all precision and recall scores, as opposed to
computing the arithmetic mean over class-wise F1 scores.

Here’s the arithmetic mean over harmonic means (AF):

And here’s the harmonic mean over arithmetic means (FA):

We looked into other papers and found out that both formulas are being used.^{1}
Sometimes it is stated which formula is used and sometimes it is not (which may be really nobody’s fault since
people may not be aware that there are two different formulas).

Historically, macro F1 is not the only metric which has caused confusion. Consider the inch metric… there used to be
different standards in the 19th century and this lead to some major trouble. Imagine an English person ordering
tiles from Amsterdam just to find out later that they are all too short to be of use! The Romans already did that better! So
we dug a bit deeper to analyze the exact difference in the two
scores and the implications for classifier evaluation. For example, you might ask, perhaps the two computations are
*almost* equivalent? Or, at least, when one formula ranks classifier A higher than B, so does the other?
Different inch metrics might lead to different scores, but they always yield the same ranking when applied to
different objects. Sounds trivial, but that’s a good thing! If that were the case for our two macro F1 formulas, we
should not worry (too) much. Unfortunately, the two macro F1 formulas

- Can differ greatly
- Do
**not**necessarily produce the same classifier ranking.

#### S0… what metric should we use?

A recent blog post also noted that
there are two macro F1 scores, yet it falls short of a deeper analysis and avoids answering the question which one
to use. Our analysis shows:^{2}

- FA is always greater or equal than AF.
- They are equivalent only in the rare circumstance that for every class: precision=recall.
- The difference can be as high as 0.5 (50 percentage points).
^{3} - In average cases, the difference can be as high as 0.02 (2 percentage points).
^{4} - FA rewards classifiers which produce skewed error type distributions.
- This is very likely to happen on imbalanced data sets, but small differences are possible even on balanced data sets.

Below are a few examples to illustrate the point. In all tables, entry ij is the number of data points labelled i by the classifier that have gold label j. Table 1: When classes have high recall and low precision, or low recall and high precision, the individual F1 score for every class will be low. Hence, AF (their mean) will also be low. FA, however, might be quite high. Tables 2,3: Introducing a bias towards class B improves one metric and impairs the other, resulting in different classifier rankings. Note that the data set is balanced.

A | B | |
---|---|---|

A | 100 | 10,000 |

B | 0 | 100 |

A | B | C | |
---|---|---|---|

A | 3,500 | 2,500 | 1,500 |

B | 5,000 | 5,000 | 5,000 |

C | 1,500 | 2,500 | 3,500 |

A | B | C | |
---|---|---|---|

A | 2,000 | 1,000 | 0 |

B | 8,000 | 8,000 | 8,000 |

C | 0 | 1,000 | 2,000 |

So let us pose some final questions: Does it make sense to use two metrics interchangeably when they produce
different rankings? And is it justifiable that a macro F1 metric yields a high score, even though class-wise F1
scores are low? We don’t think so and suggest that **AF (the arithmetic mean of class-wise F1 scores) is
to be preferred when evaluating classifiers**. However, if there is one conclusion to draw from all this
it is the following: let’s always state the formula we are using. This prevents any confusion and ensures proper
comparison between classifiers.

- Here are some examples where we know what is being used: Rudinger et al. (2018), Santos et al. (2011), Opitz and Frank (2019) compute the harmonic mean over arithmetic means. This is what is recommended in this blog-post. And here are some that calculate the arithmetic mean over class-wise harmonic means: Wu and Zhou (2017), Lipton et al. (2014), Rosenthal et al. (2015). This is what’s implemented in scikit-learn.
- Opitz and Burst (2019), Macro F1 and Macro F1
- The proof is fairly elaborate, since the well-known inequality for harmonic and arithmetic means does not hold for composite functions. If there is a shorter proof, please let us know! (opitz or burst at cl.uni-heidelberg.de) See also mathoverflow, math.stackexchange.
- See “Numerical analysis” in Opitz and Burst (2019), Macro F1 and Macro F1 for a detailed analysis of average cases.