Conformal Prediction for LLMs
Applying conformal prediction to LLMs for statistically rigorous confidence measures — guaranteed coverage without distributional assumptions.
The challenge
LLMs are notoriously overconfident. Softmax probabilities are poorly calibrated. How do you get a mathematically grounded answer to "how much should I trust this prediction?"
Approach
- Built on the MMLU benchmark — multi-choice QA across dozens of academic subjects. All 24 permutations of answer choices sent to the LLM to expose and mitigate positional bias.
- Log-probabilities extracted via Azure OpenAI API and aggregated across permutations for robust scoring.
- 30-fold stratified cross-validation comparing conformal prediction sets against a naive calibrated-threshold baseline across 40 alpha levels (0.01–0.5).
- Conformal prediction wraps outputs in prediction sets with guaranteed coverage — "the true answer is in this set at least X% of the time" — distribution-free.
Key takeaway
Conducted at Statistics Canada. Conformal prediction provides coverage guarantees that softmax calibration cannot. The permutation approach revealed how sensitive LLMs are to answer ordering — a bias invisible without systematic testing.