The Power of Scale for Parameter-Efficient Prompt Tuning - Summary
The paper explores prompt tuning, a mechanism for learning soft prompts to condition frozen language models for specific downstream tasks. The approach outperforms GPT-3's few-shot learning and becomes more competitive with scale. Prompt tuning confers benefits in robustness to domain transfer and
Arxiv URL: https://arxiv.org/abs/2104.08691
Authors: Brian Lester, Rami Al-Rfou, Noah Constant
Summary:
The paper explores prompt tuning, a mechanism for learning soft prompts to condition frozen language models for specific downstream tasks. The approach outperforms GPT-3's few-shot learning and becomes more competitive with scale. Prompt tuning confers benefits in robustness to domain transfer and enables efficient prompt ensembling.
Key Insights & Learnings:
- Prompt tuning outperforms GPT-3's few-shot learning by a large margin.
- Prompt tuning becomes more competitive with scale.
- Prompt tuning confers benefits in robustness to domain transfer.
- Prompt tuning enables efficient prompt ensembling.
- Prompt tuning is a simplification of the recently proposed prefix tuning and is sufficient to be competitive with model tuning.
Advantages:
- Parameter Efficiency: Prompt tuning requires less than 0.01% of the model's parameters to be trained for task-specific adaptation while maintaining competitive performance.
- Storage Efficiency: Eliminates the need to store separate copies of the model for each task, as only small task-specific prompts need to be stored.
- Improved Domain Transfer: Prompt tuning shows better robustness to domain shifts compared to full model fine-tuning, particularly in tasks with significant domain differences.
- Efficient Ensembling: Enables "prompt ensembling" which provides performance benefits similar to traditional model ensembling but with much lower computational overhead.
- Inference Efficiency: Allows multiple tasks to be processed in a single batch during inference, improving computational efficiency.
Limitations:
- Model Size Dependency: Prompt tuning's performance is heavily dependent on model size - smaller models may not achieve competitive results compared to traditional fine-tuning.
- Pre-training Sensitivity: Effectiveness is influenced by the pre-training objective - models pre-trained with span corruption show worse performance than those with language modeling objectives.
- Limited Interpretability: While prompts develop word-like representations, the complete sequences of prompt tokens typically lack clear interpretability.
- Initialization Sensitivity: Performance can be sensitive to prompt initialization strategies, particularly for smaller models.
- Length Requirements: Most models require longer prompts (20+ tokens) to achieve good performance, though this becomes less critical with larger models.
Terms Mentioned: prompt tuning, soft prompts, downstream tasks, backpropagation, model tuning, pre-trained models, ELMo, GPT, BERT, priming, SuperGLUE, prefix tuning, masked language model
Technologies / Libraries Mentioned: T5