paper summaries

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - Summary

The paper explores how generating a chain of thought can improve the ability of large language models to perform complex reasoning. The authors introduce a simple method called chain-of-thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiment

chain-of-thought-prompting vs standard prompting

Arxiv URL: https://arxiv.org/abs/2201.11903

Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Summary:

The paper explores how generating a chain of thought can improve the ability of large language models to perform complex reasoning. The authors introduce a simple method called chain-of-thought prompting, where a few chain-of-thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.

Key Insights & Learnings:

Chain-of-thought prompting significantly improves the ability of large language models to perform complex reasoning.
Reasoning abilities emerge naturally in sufficiently large language models via chain-of-thought prompting.
Chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.
Prompting a PaLM 540B with just eight chain-of-thought exemplars achieves state-of-the-art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.
Chain-of-thought prompting is a promising prompt engineering approach for facilitating reasoning, providing interpretability, and potentially applicable to any task that humans can solve via language.

Difference between standard prompting and chain-of-thought prompting

What are the advantages?

Requires no model fine-tuning - works with off-the-shelf language models
Provides interpretable reasoning steps that show how the model reached its answer
Generalizes across different types of reasoning tasks
Enables models to adapt computation to problem complexity

Empirical results

Tested on multiple models (PaLM, LaMDA, GPT-3) and multiple scales
Showed consistent improvements over standard prompting
Performance gains were largest on complex multi-step problems
Facilitated out-of-distribution generalization on symbolic tasks

Limitations

Only works with large models (smaller models produce incoherent reasoning)
No guarantee of correct reasoning paths
Performance varies based on prompt engineering

Terms Mentioned: Chain-of-thought prompting, Large language models, Arithmetic reasoning, Commonsense reasoning, Symbolic reasoning, Few-shot prompting, Math word problems, PaLM 540B, GSM8K benchmark, GPT-3

Technologies / Libraries Mentioned: Google Research, Neural Information Processing Systems (NeurIPS), arXiv

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models - Summary

Read next

Instruction Tuning with GPT-4 - Summary

Are We Really Making Much Progress in Text Classification? A Comparative Review - Summary

A Survey of Large Language Models - Summary