Sign in Subscribe

RL from Human Feedback

Discovering Language Model Behaviors with Model-Written Evaluations - Summary

The paper explores the use of language models (LMs) to automatically generate evaluations for testing LM behaviors. The generated evaluations are diverse and of high quality, and the approach is significantly cheaper, lower effort, and faster than manual data creation. The paper discovers new cases