--
https://arxiv.org/pdf/2402.05201.pdf - 07 Feb 2024
https://github.com/matthewrenze/jhu-llm-temperature
What is the Subject:
This paper delves into the effect of **sampling temperature** on the performance of Large Language Models (LLMs) in tackling problem-solving tasks. Sampling temperature, as stated by the paper, controls the randomness and creativity of the LLM’s output. The authors investigate whether adjusting this temperature parameter enhances the LLM’s ability to solve problems.
What Were the Goals:
The primary aim was to address the open question of optimal LLM sampling temperature for problem-solving. They wanted to move beyond anecdotal evidence and provide a systematic study with empirical results to guide best practices in LLM and prompt engineering.
### Their hypothesis was:
* Null Hypothesis (H0): Sampling temperature (within 0.0 to 1.0 range) has no effect on problem-solving performance.
* Alternative Hypothesis (H1): Adjusting sampling temperature improves problem-solving performance.
How Was it Done
* Data: They created Multiple-Choice Question-and-Answer (MCQA) exams from established LLM benchmarks, covering diverse problem domains like math, science, and law.
* Models: Four popular LLMs were used: GPT-3.5, GPT-4, Llama 2 7B, and Llama 2 70B.
* Prompts: Five prompt-engineering techniques were employed: Baseline, Domain Expertise, Self-Recitation, Chain-of-Thought (CoT), and a Composite of the techniques.
* Metrics: The primary metric was correct-answer accuracy. Additionally, various text-similarity metrics were used to assess the variability of LLM responses.
* Analysis: They analyzed accuracy across different temperatures and conducted a Kruskal-Wallis test to evaluate statistical significance.
What Were the Results
* Accuracy remained stable across sampling temperatures (0.0 to 1.0) for all LLMs and prompt types. The Kruskal-Wallis test showed no statistically significant difference, failing to reject the null hypothesis.
* Text variability increased with higher temperatures as expected, confirming the temperature’s effect on randomness and creativity.
* Beyond a temperature of 1.0, performance significantly decreased until reaching random chance at 1.4.
* Llama 2 models performed poorly due to formatting issues and incorrect answers, excluding them from detailed analysis.
Conclusion
The research suggests that adjusting sampling temperature within the 0.0 to 1.0 range does not significantly impact LLM problem-solving performance on MCQA tasks. This finding is valuable for AI engineers, potentially saving time and resources by focusing on other optimization techniques.
However, the authors acknowledge limitations and propose further research with broader problem sets, open-ended tasks, and additional LLMs to explore the full potential of sampling temperature.