paper summaries

Attention Is All You Need - Summary

The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requirin

Arxiv URL: https://arxiv.org/abs/1706.03762

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Summary:

The paper proposes a new network architecture called Transformer that relies solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train. The paper also discusses the advantages of self-attention over other models and describes the architecture of the Transformer in detail.

Key Insights & Learnings:

The Transformer is a new network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality while requiring significantly less time to train.
Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.
Most competitive neural sequence transduction models have an encoder-decoder structure. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder.
The paper also describes the attention functions used in the Transformer, including Scaled Dot-Product Attention and Multi-Head Attention.

Terms Mentioned: Transformer, attention mechanisms, recurrence, convolutions, machine translation, BLEU score, self-attention, sequence modeling, transduction models, encoder-decoder structure, Scaled Dot-Product Attention, Multi-Head Attention

Technologies / Libraries Mentioned: Tensor2Tensor

Instruction Tuning with GPT-4 - Summary

The paper presents the first attempt to use GPT-4 to generate instruction-following data for Large Language Models (LLMs) finetuning. The 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks compared to the instruction-following

Are We Really Making Much Progress in Text Classification? A Comparative Review - Summary

This paper reviews and compares methods for single-label and multi-label text classification, categorizing them into bag-of-words, sequence-based, graph-based, and hierarchical methods. The findings reveal that pre-trained language models outperform all recently proposed graph-based and hierarchy-b

A Survey of Large Language Models - Summary

This paper surveys the recent advances in Large Language Models (LLMs), which are pre-trained Transformer models over large-scale corpora. The paper discusses the background, key findings, and mainstream techniques of LLMs, focusing on pre-training, adaptation tuning, utilization, and capacity eval

Read next