The Power of Scale for Parameter-Efficient Prompt Tuning - Summary

The paper explores prompt tuning, a mechanism for learning soft prompts to condition frozen language models for specific downstream tasks. The approach outperforms GPT-3's few-shot learning and becomes more competitive with scale. Prompt tuning confers benefits in robustness to domain transfer and

Arxiv URL: https://arxiv.org/abs/2104.08691

Authors: Brian Lester, Rami Al-Rfou, Noah Constant

Summary:

The paper explores prompt tuning, a mechanism for learning soft prompts to condition frozen language models for specific downstream tasks. The approach outperforms GPT-3's few-shot learning and becomes more competitive with scale. Prompt tuning confers benefits in robustness to domain transfer and enables efficient prompt ensembling.

Key Insights & Learnings:

  • Prompt tuning outperforms GPT-3's few-shot learning by a large margin.
  • Prompt tuning becomes more competitive with scale.
  • Prompt tuning confers benefits in robustness to domain transfer.
  • Prompt tuning enables efficient prompt ensembling.
  • Prompt tuning is a simplification of the recently proposed prefix tuning and is sufficient to be competitive with model tuning.

Advantages:

  1. Parameter Efficiency: Prompt tuning requires less than 0.01% of the model's parameters to be trained for task-specific adaptation while maintaining competitive performance.
  2. Storage Efficiency: Eliminates the need to store separate copies of the model for each task, as only small task-specific prompts need to be stored.
  3. Improved Domain Transfer: Prompt tuning shows better robustness to domain shifts compared to full model fine-tuning, particularly in tasks with significant domain differences.
  4. Efficient Ensembling: Enables "prompt ensembling" which provides performance benefits similar to traditional model ensembling but with much lower computational overhead.
  5. Inference Efficiency: Allows multiple tasks to be processed in a single batch during inference, improving computational efficiency.

Limitations:

  1. Model Size Dependency: Prompt tuning's performance is heavily dependent on model size - smaller models may not achieve competitive results compared to traditional fine-tuning.
  2. Pre-training Sensitivity: Effectiveness is influenced by the pre-training objective - models pre-trained with span corruption show worse performance than those with language modeling objectives.
  3. Limited Interpretability: While prompts develop word-like representations, the complete sequences of prompt tokens typically lack clear interpretability.
  4. Initialization Sensitivity: Performance can be sensitive to prompt initialization strategies, particularly for smaller models.
  5. Length Requirements: Most models require longer prompts (20+ tokens) to achieve good performance, though this becomes less critical with larger models.


Terms Mentioned: prompt tuning, soft prompts, downstream tasks, backpropagation, model tuning, pre-trained models, ELMo, GPT, BERT, priming, SuperGLUE, prefix tuning, masked language model

Technologies / Libraries Mentioned: T5