Attention Is All You Need - Summary
The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requirin
Arxiv URL: https://arxiv.org/abs/1706.03762
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Summary:
The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train. The paper also discusses the advantages of self-attention over other models and describes the architecture of the Transformer in detail.
Key Insights & Learnings:
- The Transformer is a new network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
- The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train.
- Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.
- Most competitive neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
- The paper also describes the attention functions used in the Transformer, including Scaled Dot-Product Attention and Multi-Head Attention.
Terms Mentioned: Transformer, attention mechanisms, recurrence, convolutions, machine translation, BLEU score, self-attention, sequence modeling, transduction models, encoder-decoder structure, Scaled Dot-Product Attention, Multi-Head Attention
Technologies / Libraries Mentioned: Tensor2Tensor