Discovering Language Model Behaviors with Model-Written Evaluations - Summary
The paper explores the use of language models (LMs) to automatically generate evaluations for testing LM behaviors. The generated evaluations are diverse and of high quality, and the approach is significantly cheaper, lower effort, and faster than manual data creation. The paper discovers new cases