Improving Semantic Uncertainty Quantification in Language Models via Token-Level Temperature Scaling

1University of Oxford    2University of Toronto

Motivation

Motivation figure

Temperature Scaling Improves Semantic Uncertainty Quantification. We compare the same base model with different temperature parameters, each generating ten responses for the same input, and cluster responses into semantic groups. We compute the semantic confidence measure introdcued by Kuhn et al. (2023). Panel (a) uses the recommended temperature of 0.5. Panel (b) uses a temperature optimized on a calibration set. Optimised temperature scaling offers a simple way to improve both semantic calibration and discrimination.

Key research question

How does optimising a single token-level temperature parameter affect the semantic calibration and discrimination of semantic confidence measures, and the discriminability of quantities such as semantic entropy derived from these confidence measures?

Abstract

Calibration is central to reliable semantic uncertainty quantification in language models, yet prior work has largely focused on the discriminative use of semantic uncertainty, neglecting calibration. In this paper, we address this gap in the literature and study both semantic calibration and discrimination across a broad set of semantic confidence measures. We conduct a careful empirical evaluation and find that optimising a single, token-level temperature parameter is a simple and effective method for improving semantic uncertainty quantification. Across semantic confidence measures, models, and QA datasets, token-level temperature optimisation consistently improves semantic calibration, discrimination, and semantic entropy. Notably, uncertainty-focused temperature optimisation outperforms both widely-used fixed-temperature baselines and more sophisticated calibration methods for semantic uncertainty quantification.

Semantic Confidence Measures

Motivation figure

Figure: Figure showing how existing E-SC (Farquhar et al., 2024) and L-SC (Kuhn et al., 2023), as well as a broad set of novel baseline semantic measures (ML-SC, B-SC, T-SC, IC-SC, and G-SC), are computed. For an input \( \mathbf{x} \), we sample multiple responses from the model \( p(\cdot \mid \mathbf{x}) \), and use an NLI model to assess bidirectional entailment, determining whether responses \( \mathbf{y}^{i} \) and \( \mathbf{y}^{j} \) are semantically equivalent (\( \mathbf{y}^{i} \sim \mathbf{y}^{j} \mid \mathbf{x} \)). Here, \( s(C \mid \mathbf{x}) \) denotes the sum, \( \bar{s}(C \mid \mathbf{x}) \) the average, \( \mathcal{H}(p_{C_i}) \) the entropy, and \( \mathcal{E}(C_i \mid \mathbf{x}) \) the entropy of the length-normalised log-likelihoods of generations within cluster \( C \). See Section Confidence Measures for further details.

Methodology

We evaluate the calibration and discrimination of semantic confidence measures across multiple question-answering datasets, including TriviaQA, Natural Questions, and SQuAD, using popular instruction-tuned language models. To investigate the affect of optimisng temperature paramters (Temperature Scaling (TS)) on semantic uncertainty quantification, we and compare against several baseline post-hoc, token-level recalibration techniques: an Adaptive Temperature Scaling (ATS) head that predicts token-specific temperatures, Platt Scaling with a diagonal affine logit transform, and fixed-temperature baselines of τ = 1.0 (Base) and τ = 0.5 (SE). We compare how each method influences semantic calibration, discrimination, and uncertainty across semantic confidence measures.

Results

We report results across models (7-8B Llama, Qwen and Mistral models) and question-answering datasets (closed book: TriviaQA and Natural Questions; open book: SQuAD), focusing on how different token-level recalibration techniques impact semantic calibration and discrimination.

Temperature Scaling Improves Semantic Calibration and Discrimination

Optimised Temperature Scaling (TS) consistently improves semantic uncertainty quantification, outperforming both fixed-temperature heuristics (Base and SE) used in prior work, and more complex calibration methods such as Adaptive Temperature Scaling (ATS) and Platt Scaling. Improvements hold across all question-answering datasets, demonstrating that TS provides a simple, robust, and effective means of enhancing both semantic calibration and discrimination of semantic confidence measures.

Results figure 1: Semantic calibration and discrimination metrics

Figure: Uncertainty Metrics of SC Measures Across Methods. Mean and standard error of \( \widehat{\mathrm{ACE}} \) (\( \downarrow \)) and AUROC (\( \uparrow \)) scores for SC measures across baseline, calibration methods, and datasets. Closer to the top-left of plots indicates better discrimination and calibration, and hence better overall semantic uncertainty quantification.

Temperature Scaling Improves Discriminability of the Semantic Entropy Derived from Semantic Confidence Measures

Building on the calibration and discrimination results above, we next evaluate how token-level temperature optimisation affects downstream semantic entropy (SE). We compare a principled formulation, SEconf, where the final answer is drawn from the most confident semantic cluster, against the heuristic baseline SEvanilla that determines correctness via greedy decoding while sampling from a temperature-smoothed distribution. Across datasets, optimised temperature scaling (TS) consistently improves the discriminative power of entropy under both definitions, surpassing fixed-temperature heuristics (Base and SE with τ ∈ {0.5, 1.0}) and demonstrating that aligning prediction and uncertainty distributions yields more reliable semantic uncertainty estimates.

Results figure 2: Uncertainty measures and semantic entropy comparison

Figure: Discrimination Comparison of Entropy for Qwen. Mean and standard error of AUROC (\( \uparrow \)) values. (a) reports \( \mathrm{SE}_{\mathrm{conf}} \), where correctness is determined by the most confident semantic cluster under a given SC measure. (b) reports \( \mathrm{SE}_{\mathrm{vanilla}} \) from Kuhn et al. (2023), where correctness is determined via greedy decoding. Bold entries denote the best result within each SC measure per dataset, and underlined entries indicate the best overall per dataset.

Key Takeaways

  • Temperature scaling (TS) is a simple yet highly effective method for improving semantic uncertainty quantification in language models.
  • Optimising a single scalar temperature parameter substantially enhances both semantic calibration and discrimination across QA datasets.
  • TS consistently outperforms more complex methods such as Adaptive Temperature Scaling (ATS) and Platt Scaling.
  • Fixed-temperature baselines (τ = 1.0 and τ = 0.5) used in prior work are suboptimal for semantic calibration.
  • Better calibrated semantic confidence measures lead to more reliable downstream uncertainty metrics such as semantic entropy.
  • Overall, token-level temperature optimisation provides a simple, robust, and computationally efficient approach for improving the reliability of language models.

BibTeX

@article{lamb2025,
  title={Improving Semantic Uncertainty Quantification in Language Models via Token-Level Temperature Scaling},
  author={Lamb, T.A. and Ivanova, Desi and Torr, Philip H.S. and Rudner, Tim G.J.},
  journal={arXiv preprint},
  year={2025},
  archivePrefix={arXiv},
}