LoRA: Low-Rank Adaptation of Large Language Models - Summary
The paper proposes Low-Rank Adaptation (LoRA) as an approach to reduce the number of trainable parameters for downstream tasks in natural language processing. LoRA injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable
Arxiv URL: https://arxiv.org/abs/2106.09685
Authors: Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen
Summary:
The paper proposes Low-Rank Adaptation (LoRA) as an approach to reduce the number of trainable parameters for downstream tasks in natural language processing. LoRA injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency.
Key Insights & Learnings:
- LoRA reduces the number of trainable parameters for downstream tasks in natural language processing by injecting trainable rank decomposition matrices into each layer of the Transformer architecture.
- LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3.
- LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times compared to GPT-3 175B fine-tuned with Adam.
- LoRA allows a pre-trained model to be shared and used to build many small LoRA modules for different tasks, reducing the storage requirement and task-switching overhead.
- LoRA makes training more efficient and lowers the hardware barrier to entry by up to 3 times when using adaptive optimizers since we do not need to calculate the gradients or maintain the optimizer states for most parameters.
Terms Mentioned: Low-Rank Adaptation, Transformer architecture, fine-tuning, RoBERTa, DeBERTa, GPT-2, GPT-3, rank decomposition matrices, inference latency, language modeling
Technologies / Libraries Mentioned: PyTorch