Attention Is All You Need - Summary

The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requirin

Arxiv URL: https://arxiv.org/abs/1706.03762

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Summary:

The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train. The paper also discusses the advantages of self-attention over other models and describes the architecture of the Transformer in detail.

Key Insights & Learnings:

  • The Transformer is a new network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
  • The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train.
  • Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.
  • Most competitive neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
  • The paper also describes the attention functions used in the Transformer, including Scaled Dot-Product Attention and Multi-Head Attention.


Terms Mentioned: Transformer, attention mechanisms, recurrence, convolutions, machine translation, BLEU score, self-attention, sequence modeling, transduction models, encoder-decoder structure, Scaled Dot-Product Attention, Multi-Head Attention

Technologies / Libraries Mentioned: Tensor2Tensor