Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling

Tom A. Lamb¹, Desi R. Ivanova¹, Philip H. S. Torr¹, Tim G. J. Rudner²

¹University of Oxford ²University of Toronto & Vijil

arXiv Code

Motivation

Temperature Scaling Improves Semantic Uncertainty Quantification. We compare the same base model with different temperature parameters, each generating ten responses for the same input, and cluster responses into semantic groups. We compute the semantic confidence measure introduced by Kuhn et al. (2023). Panel (a) uses the recommended temperature of 0.5. Panel (b) uses a temperature optimised on a calibration set. Optimised temperature scaling offers a simple way to improve both semantic calibration and discrimination.

Abstract

Calibration is central to reliable semantic uncertainty quantification, yet prior work has largely focused on discrimination, neglecting calibration. As calibration and discrimination capture distinct aspects of uncertainty, focusing on discrimination alone yields an incomplete picture. We address this gap by systematically evaluating both aspects across a broad set of confidence measures. We show that current approaches, particularly fixed-temperature heuristics, produce systematically miscalibrated and poorly discriminative semantic confidence distributions. We demonstrate that optimising a single scalar temperature, which, we argue, provides a suitable inductive bias, is a surprisingly simple yet effective solution. Our exhaustive evaluation confirms that temperature scaling consistently improves semantic calibration, discrimination, and downstream entropy, outperforming both heuristic baselines and more expressive token-level recalibration methods on question-answering tasks.

Semantic Confidence Measures

Figure: Figure showing how existing E-SC (Farquhar et al., 2024) and L-SC (Kuhn et al., 2023), as well as a broad set of novel baseline semantic measures (ML-SC, B-SC, T-SC, IC-SC, and G-SC), are computed. For an input \( \mathbf{x} \), we sample multiple responses from the model \( p(\cdot \mid \mathbf{x}) \), and use an NLI model to assess bidirectional entailment, determining whether responses \( \mathbf{y}^{i} \) and \( \mathbf{y}^{j} \) are semantically equivalent (\( \mathbf{y}^{i} \sim \mathbf{y}^{j} \mid \mathbf{x} \)). Here, \( s(C \mid \mathbf{x}) \) denotes the sum, \( \bar{s}(C \mid \mathbf{x}) \) the average, \( \mathcal{H}(p_{C_i}) \) the entropy, and \( \mathcal{E}(C_i \mid \mathbf{x}) \) the entropy of the length-normalised log-likelihoods of generations within cluster \( C \). See Section Confidence Measures for further details.

Methodology

We evaluate the calibration and discrimination of semantic confidence measures across multiple question-answering datasets, including TriviaQA, Natural Questions, and SQuAD, using popular instruction-tuned language models. To investigate the affect of optimisng temperature paramters (Temperature Scaling (TS)) on semantic uncertainty quantification, we and compare against several baseline post-hoc, token-level recalibration techniques: an Adaptive Temperature Scaling (ATS) head that predicts token-specific temperatures, Platt Scaling with a diagonal affine logit transform, and fixed-temperature baselines of τ = 1.0 (Base) and τ = 0.5 (SE). We compare how each method influences semantic calibration, discrimination, and uncertainty across semantic confidence measures.

Results

We report results across models (7-8B Llama, Qwen and Mistral models) and question-answering datasets (closed book: TriviaQA and Natural Questions; open book: SQuAD), focusing on how different token-level recalibration techniques impact semantic calibration and discrimination.

Temperature Scaling Improves Semantic Calibration and Discrimination

Optimised Temperature Scaling (TS) consistently improves semantic uncertainty quantification, outperforming both fixed-temperature heuristics (Base and SE) used in prior work, and more complex calibration methods such as Adaptive Temperature Scaling (ATS) and Platt Scaling. Improvements hold across all question-answering datasets, demonstrating that TS provides a simple, robust, and effective means of enhancing both semantic calibration and discrimination of semantic confidence measures.

Results figure 1: Semantic calibration and discrimination metrics

Figure: Uncertainty Metrics of SC Measures Across Methods. Mean and standard error of \( \widehat{\mathrm{ACE}} \) (\( \downarrow \)) and AUROC (\( \uparrow \)) scores for SC measures across baseline, calibration methods, and datasets. Closer to the top-left of plots indicates better discrimination and calibration, and hence better overall semantic uncertainty quantification.

Temperature Scaling Improves Discriminability of the Semantic Entropy Derived from Semantic Confidence Measures

Building on the calibration and discrimination results above, we next evaluate how token-level temperature optimisation affects downstream semantic entropy (SE). We compare a principled formulation, SE_conf, where the final answer is drawn from the most confident semantic cluster, against the heuristic baseline SE_vanilla that determines correctness via greedy decoding while sampling from a temperature-smoothed distribution. Across datasets, optimised temperature scaling (TS) consistently improves the discriminative power of entropy under both definitions, surpassing fixed-temperature heuristics (Base and SE with τ ∈ {0.5, 1.0}) and demonstrating that aligning prediction and uncertainty distributions yields more reliable semantic uncertainty estimates.

Results figure 2: Uncertainty measures and semantic entropy comparison

Figure: Discrimination Comparison of Entropy for Qwen. Mean and standard error of AUROC (\( \uparrow \)) values. (a) reports \( \mathrm{SE}_{\mathrm{conf}} \), where correctness is determined by the most confident semantic cluster under a given SC measure. (b) reports \( \mathrm{SE}_{\mathrm{vanilla}} \) from Kuhn et al. (2023), where correctness is determined via greedy decoding. Bold entries denote the best result within each SC measure per dataset, and underlined entries indicate the best overall per dataset.

Key Takeaways

Temperature scaling (TS) is a simple yet highly effective method for improving semantic uncertainty quantification in language models.
Optimising a single scalar temperature parameter substantially enhances both semantic calibration and discrimination across QA datasets.
TS consistently outperforms more complex methods such as Adaptive Temperature Scaling (ATS) and Platt Scaling.
Fixed-temperature baselines (τ = 1.0 and τ = 0.5) used in prior work are suboptimal for semantic calibration.
Better calibrated semantic confidence measures lead to more reliable downstream uncertainty metrics such as semantic entropy.
Overall, token-level temperature optimisation provides a simple, robust, and computationally efficient approach for improving the reliability of language models.

BibTeX

@inproceedings{lamb2026improving,
title={Improving Semantic Uncertainty Quantification in Language Model Question-Answering via Token-Level Temperature Scaling},
author={Tom A. Lamb and Desi R. Ivanova and Philip Torr and Tim G. J. Rudner},
booktitle={The 29th International Conference on Artificial Intelligence and Statistics},
year={2026},
url={https://openreview.net/forum?id=kuBkI1fbJH}
}