Arxiv URL: https://arxiv.org/abs/2203.02155
Authors: Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
The paper presents a method for aligning language models with user intent by fine-tuning with human feedback. The resulting models, called InstructGPT, show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. The paper also discusses the misalignment of language modeling objectives and the importance of training language models to act in accordance with the user's intention.
Key Insights & Learnings:
- Fine-tuning language models with human feedback is a promising direction for aligning language models with human intent.
- InstructGPT models generated outputs that are preferred by labelers over outputs from GPT-3, despite having over 100x fewer parameters.
- InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.
- The language modeling objective is misaligned with following the user's instructions helpfully and safely.
- Public NLP datasets are not reflective of how language models are used.
Terms Mentioned: language models, fine-tuning, human feedback, InstructGPT, truthfulness, toxic output generation, NLP datasets
Technologies / Libraries Mentioned: GPT-3, OpenAI API, PPO algorithm, RealToxicityPrompts dataset