Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - Summary

The paper explores how generating a chain of thought can improve the ability of large language models to perform complex reasoning. The authors introduce a simple method called chain-of-thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiment

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - Summary
chain-of-thought-prompting vs standard prompting

Arxiv URL: https://arxiv.org/abs/2201.11903

Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Summary:

The paper explores how generating a chain of thought can improve the ability of large language models to perform complex reasoning. The authors introduce a simple method called chain-of-thought prompting, where a few chain-of-thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.

Key Insights & Learnings:

  • Chain-of-thought prompting significantly improves the ability of large language models to perform complex reasoning.
  • Reasoning abilities emerge naturally in sufficiently large language models via chain-of-thought prompting.
  • Chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
  • Prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
  • Chain-of-thought prompting is a promising prompt engineering approach for facilitating reasoning, providing interpretability, and potentially applicable to any task that humans can solve via language.
Difference between standard prompting and chain-of-thought prompting
Difference between standard prompting and chain-of-thought prompting

What are the advantages?

  • Requires no model fine-tuning - works with off-the-shelf language models
  • Provides interpretable reasoning steps that show how the model reached its answer
  • Generalizes across different types of reasoning tasks
  • Enables models to adapt computation to problem complexity

Empirical results

  • Tested on multiple models (PaLM, LaMDA, GPT-3) and multiple scales
  • Showed consistent improvements over standard prompting
  • Performance gains were largest on complex multi-step problems
  • Facilitated out-of-distribution generalization on symbolic tasks

Limitations

  • Only works with large models (smaller models produce incoherent reasoning)
  • No guarantee of correct reasoning paths
  • Performance varies based on prompt engineering


Terms Mentioned: Chain-of-thought prompting, Large language models, Arithmetic reasoning, Commonsense reasoning, Symbolic reasoning, Few-shot prompting, Math word problems, PaLM 540B, GSM8K benchmark, GPT-3

Technologies / Libraries Mentioned: Google Research, Neural Information Processing Systems (NeurIPS), arXiv