Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes - Summary

The paper introduces a new mechanism called Distilling step-by-step that trains smaller models to outperform larger language models (LLMs) while using less training data and smaller model sizes. The mechanism extracts LLM rationales as additional supervision for small models within a multi-task tra

Arxiv URL: https://arxiv.org/abs/2305.02301

Authors: Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

Summary:

The paper introduces a new mechanism called Distilling step-by-step that trains smaller models to outperform larger language models (LLMs) while using less training data and smaller model sizes. The mechanism extracts LLM rationales as additional supervision for small models within a multi-task training setup. The paper presents three findings across 4 NLP benchmarks: (1) Distilling step-by-step achieves better performance with much fewer labeled/unlabeled training examples compared to both fine-tuning and distillation, (2) smaller models outperform LLMs with much smaller model sizes, and (3) the mechanism reduces both the model size and the amount of data required to outperform LLMs.

Key Insights & Learnings:

  • Distilling step-by-step trains smaller models that outperform LLMs while using less training data and smaller model sizes.
  • The mechanism extracts LLM rationales as additional supervision for small models within a multi-task training setup.
  • Distilling step-by-step achieves better performance with much fewer labeled/unlabeled training examples compared to both fine-tuning and distillation.
  • Smaller models outperform LLMs with much smaller model sizes.
  • The mechanism reduces both the model size and the amount of data required to outperform LLMs.


Terms Mentioned: large language models, distillation, fine-tuning, multi-task training, rationales, NLP benchmarks, model size, training data

Technologies / Libraries Mentioned: BERT, T5