Model Calibration when Subsampling Negatives

4 min readAug 17, 2019

With class-imbalanced ML problems, it’s often convenient to subsample the negative examples to speed up data pipelines or training jobs. For example, one of our problems at Abnormal Security is to classify rare social engineering and phishing email attacks which occur at a rate of between 0.1% and 0.001% of all messages volume. Because of this class imbalance we want to include every single positive example (attacks) but only a portion of negative examples (safe emails).

For example:
Real Distribution — 100mil negatives, 10k positives
Training Distribution — 10mil negatives (10% subsample), 10k positives

When we train a model on this data we may get good predictive power, but those predictions will not be actual probabilities (most obviously, the mean is shifted, your model will predict the average class probability to be 0.001 when in reality it is 0.0001)

Definition: Calibration — Matching the distribution of a model’s prediction with the real distribution: pred(class=1| example) ≈ prob(class=1| example)

When is calibration important?

You don’t need to calibrate your models if you are just using only the relative output of the classifier

Ranking and recommendation problems: No need to calibrate
— For example, Twitter timeline is ranked using relative outputs of classification models, these do not need to be exact probabilities
Probabilistic decision making: Needs calibration
— In online advertisement bidding, you might want to calculate the true expected probability a user will click for cost-per-click advertising
— In sequential decision-making or probabilistic planning problems, you want to model real transition probabilities to make informed decisions, for example, multi-armed bandits or reinforcement learning problems
Classification problems: Needs calibration
—For example, when classifying an email as a phishing attack or if a video frame contains a stop sign, you may want true probability to control precision so that you know exactly how many false positives to expect.

Subsampling, Precision, and Thresholding

When building a classifier, we threshold the output of our model to create a decision boundary. Often we want to precisely control the precision

Precision — Rate of true positives flagged by classifier: prec = TP/(TP + FP)

Thresholding — Selecting a threshold τ for which we can classify by predicting: pred(example) > τ as class 1, otherwise class 0

Thresholding with Uncalibrated Classifier on Subsampled Dataset

If we choose a threshold on a subsampled evaluation dataset, the real results will be quite a bit different on the full dataset (and in online performance). Here’s a made-up example comparing a classifier on 1mil subsampled negatives vs original dataset with 10mil negatives:

In the example above, if we were to choose a threshold of 0.5, it will achieve 96.6% precision on the subsampled data, but only 74.2% precision on the full dataset.

Precision on the subsample compared to the full dataset?

Let’s say our subsample rate is ρ, and our precision on the subsampled evaluation data at the given value is π(x). The precision on the full dataset will be: precision(x) = ρ (π(x) / (π(x) ( ρ-1) + 1))

For example in the above graph on the right, we see that precision(x=0.5) indeed 0.74=0.1 * 0.9663 / (0.9663 * -0.9 + 1)

This formula allows us to back out the precision for a given threshold if you know the subsampling rate. You can build the π using Isotonic regression on the subsampled data, for example, as described here (essentially using the curve in the right-side graph above). Once you have a function π calibrated on the subsampled data, you can get the true probabilities using this described formula.

calibrated(x) = ρ (π(x) / (π(x) ( ρ-1) + 1))

If all goes well then calibrated(x) ≈ prob(class=1 | x) in your online application

Problems you might encounter

In the ideal setting, the above technique works, however, in reality, you will often run into certain problems

Online data might not match up to your batch data. Even small errors in calibrated probabilities can change the actual false positives significantly, especially if you are calibrating toward 99%. An error of just 1% will double your false positives.
The subsample rate might not be exact because of issues in your data pipeline, missing days of data, or because you want to use positive examples from a different date range than your negative examples in evaluation.

Alternatives

Often it is impossible to completely trust offline calibration. Some options:

Use a full dataset for calibration. If you can calibrate on a full, unsampled, dataset, you can just build your calibrated model using Isotonic regression. Generating this dataset potentially has its own problems, for example, you must ensure this dataset is fully labeled and clean, hopefully trusting it to match up to the online performance, and it will need to be large enough to contain enough positive examples. Also, it can potentially be a problem if your data distribution drifts over time, in which case you will constantly need to update this calibration dataset with newer data.
Use online log data for calibration. This can solve issues of data discrepancy, however, this can have its own problems, especially for highly class-imbalanced problems. For example, you may not have enough labeled positive examples to correctly calibrate. Calibrating against the number of flagged online examples rather than precision and then using your offline calibration to back out precision from this value is another option. This has the advantage of not needing real-time positive class labels.

Summary

Calibration is often one of the hardest parts of building an effective classification system, and using subsampled data only adds to the difficulty. This is especially true for classifying rare events where even slight mistakes in probability can lead to large numbers of false positives. Make sure, if nothing else, to have online measurement and sampling to ensure your system is classifying with the precision you expect.