Conformal Prediction for LLMs

Applying conformal prediction to LLMs for statistically rigorous confidence measures — guaranteed coverage without distributional assumptions.

Python · Statistics · LLMs

The challenge

LLMs are notoriously overconfident. Softmax probabilities are poorly calibrated. How do you get a mathematically grounded answer to "how much should I trust this prediction?"

Approach

Built on the MMLU benchmark — multi-choice QA across dozens of academic subjects. All 24 permutations of answer choices sent to the LLM to expose and mitigate positional bias.
Log-probabilities extracted via Azure OpenAI API and aggregated across permutations for robust scoring.
30-fold stratified cross-validation comparing conformal prediction sets against a naive calibrated-threshold baseline across 40 alpha levels (0.01–0.5).
Conformal prediction wraps outputs in prediction sets with guaranteed coverage — "the true answer is in this set at least X% of the time" — distribution-free.

Key takeaway

Conducted at Statistics Canada. Conformal prediction provides coverage guarantees that softmax calibration cannot. The permutation approach revealed how sensitive LLMs are to answer ordering — a bias invisible without systematic testing.

Source on GitHub →