Connor Boyle's Website

Find me on:

Posts:

17 December 2023

Scikit-Learn's F-1 calculator is broken

by Connor Boyle

TL;DR: if you are using scikit-learn 1.3.X and use f1_score() or classification_report() with the argument zero_division=1.0 or zero_division=np.nan1, then there’s a chance that the output of that function is wrong (possibly by any amount up to 100%, depending on the number of classes in your dataset). E.g. for zero_division=1.0:

>>> sklearn.__version__
'1.3.0'
>>> sklearn.metrics.f1_score(y_true=list(range(104)), y_pred=list(range(100)) + [101, 102, 103, 104], average='macro', zero_division=1.0)
0.9809523809523809  # incorrect

compare to (the exact same expression in an earlier version of Scikit-Learn):

>>> sklearn.__version__
'1.2.2'
>>> sklearn.metrics.f1_score(y_true=list(range(104)), y_pred=list(range(100)) + [101, 102, 103, 104], average='macro', zero_division=1.0)
0.9523809523809523  # correct

Similar cases for zero_division=np.nan (which was introduced in 1.3.0, so I can’t directly compare to the output in 1.2.2):

>>> sklearn.metrics.f1_score([0, 1], [1, 0], average='macro', zero_division=np.nan)
nan  # should be 0.0
>>> sklearn.metrics.f1_score([0, 1, 2], [1, 0, 2], average='macro', zero_division=np.nan)
1.0  # should be ~0.67

Both myself and the Scikit-Learn maintainers consider the behavior in 1.3.X to be incorrect. While a pull request to fix this behavior was just merged, the fix has not yet shipped on any released version of Scikit-Learn. Therefore, the easiest solution to this specific problem is to revert to Scikit-Learn 1.2.2, or use zero_division=0.0 if possible, while being careful to understand how this parameter change will affect precision, recall, & F-1 (see below for an explainer on the purpose and function of the zero_division parameter).

(EDIT 2024-01-24: Scikit-Learn 1.4.0 has been released as of a week ago and contains a fix for this bug. Go and update now!)

The problem is that F-1 for an individual class is getting calculated as 1.0 or np.nan when precision & recall are both 0.0 (which is not the desired behavior for the zero_division parameter).

How did this happen?

Let’s take a look at some formulae for classification metrics:

\[\textrm{precision} = \frac{\textrm{true positive}}{\textrm{true positive} + \textrm{false positive}}\] \[\textrm{recall} = \frac{\textrm{true positive}}{\textrm{true positive} + \textrm{false negative}}\] \[\textrm{F}_1 = \frac{2 \cdot \textrm{precision} \cdot \textrm{recall}}{\textrm{precision} + \textrm{recall}}\]

There are three different places here where a division by zero can occur:

Two of these are interesting cases where reasonable people could disagree on what the correct behavior should be:

For F-1, however, the “division by zero” case is not interesting or controversial in any way. If a classifier has achieved a recall of 0.0 (all negative predictions are false) and a precision of 0.0 (all positive predictions are false), I don’t think any reasonable person would disagree what the F-1 score should be: 0.0. Indeed, this is exactly how Scikit-Learn calculated F-1 right up to (and including) version 1.2.2, regardless of the value of the zero_division parameter.

However, in Scikit-Learn 1.3.0, the zero_division parameter was turned into a kind of monkey’s paw that defines the behavior of any division-by-zero that happens to occur during the calculation of an F-1 score, leading to the bizarre scenario where a 100% wrong classifier can get an F-1 score of 100%:2

>>> sklearn.__version__
'1.3.0'
>>> print(sklearn.metrics.classification_report(y_true=[0, 1, 2, 3, 4], y_pred=[1, 2, 3, 4, 0], zero_division=1.0))
              precision    recall  f1-score   support

           0       0.00      0.00      1.00       1.0
           1       0.00      0.00      1.00       1.0
           2       0.00      0.00      1.00       1.0
           3       0.00      0.00      1.00       1.0
           4       0.00      0.00      1.00       1.0

    accuracy                           1.00       5.0
   macro avg       0.00      0.00      1.00       5.0
weighted avg       0.00      0.00      1.00       5.0

Why? Because precision and recall are both 0, which means the denominator of the F-1 formula is 0, and zero_division=1.0 now (as of Scikit-Learn 1.3.0) applies to the F-1 calculation itself, so that means F-1 is calculated (incorrectly) as 1.0!

Why does this matter?

I don’t know if there are rigorous statistics on this, but I’d wager that macro average F-1 is the most commonly used metric for multiclass classification by a wide margin. Scikit-Learn’s f1_score() function is in turn very likely the most commonly used implementation of F-1. Try asking Google or ChatGPT how to calculate F-1; the first results will very likely tell you to use this exact function in Scikit-Learn.

The kinds of tasks F-1 could be used for range from low-risk, like sentiment analysis on customer reviews, to some conceivably really safety-critical things. Imagine a researcher at an autonomous car company thinks their computer vision system is performing really well at recognizing all categories of objects & entities on the road. But actually, their classifier is completely missing every single example of a few classes!

Ideally, any machine learning practitioner probably should notice this bug well before a classifier is put into production or reporting results in a submitted journal paper. On the other hand, you really would not expect the definition of F-1 to change from one version of Scikit-Learn to the next! While just about any programmer should be able to implement an F-1 calculator in very little time, most of us prefer to just import Scikit-Learn’s specifically to avoid gotcha edge cases like this one.

What should I do now?

If your project:

it may have been affected by this bug. To determine if any particular F-1 score calculation was impacted by this bug, first change that F-1 score calculation to a classification_report() if possible. If any class in that classification report contains a precision of 0.0, a recall of 0.0, and an f1-score of 1.0 or nan, then the F-1 score for this classifier has been calculated incorrectly.

Any call using zero_division=1.0 can be fixed by reverting to Scikit-Learn version 1.2.2. Unfortunately, the parameter zero_division=np.nan did not exist in Scikit-Learn 1.2.2, and I don’t believe there is any easy way to replicate it.

(EDIT 2024-01-24: Scikit-Learn 1.4.0 has been released, and you should update to it ASAP!)


Footnotes:

  1. In this post, np.nan refers to numpy.nan 

  2. A completely wrong classifier can also get an F-1 score of 0.0 in Scikit-Learn 1.3.X, for example:

    >>> print(sklearn.metrics.classification_report(y_true=[0, 0, 0], y_pred=[1, 1, 1], zero_division=1.0))
                  precision    recall  f1-score   support
    
               0       1.00      0.00      0.00       3.0
               1       0.00      1.00      0.00       0.0
    
        accuracy                           1.00       3.0
       macro avg       0.50      0.50      0.00       3.0
    weighted avg       1.00      0.00      0.00       3.0
    

    (correctly) receives an F-1 of 0.0, because in each class, either precision or recall (but never both) is zero, which means that the denominator of the F-1 score for each class is nonzero. 

tags: