Scaling Transformer to 1M tokens and beyond with RMT - Summary

The paper presents a method to extend the context length of BERT, a Transformer-based model in natural language processing, by incorporating token-based memory storage and segment-level recurrence with recurrent memory (RMT). The method enables the model to store task-specific information across up

Arxiv URL: https://arxiv.org/abs/2304.11062v1

Authors: Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev

Summary:

The paper presents a method to extend the context length of BERT, a Transformer-based model in natural language processing, by incorporating token-based memory storage and segment-level recurrence with recurrent memory (RMT). The method enables the model to store task-specific information across up to 2 million tokens, significantly exceeding the largest input size reported for transformer models. The paper also demonstrates the effectiveness of the approach in enhancing long-term dependency handling in natural language understanding and generation tasks as well as enabling large-scale context processing for memory-intensive applications.

Key Insights & Learnings:

  • The method enables BERT to store task-specific information across up to 2 million tokens, significantly exceeding the largest input size reported for transformer models.
  • The method maintains the base model's memory size at 3.6 GB in the experiments.
  • The method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence.
  • The method scales linearly for any model size if the segment length is fixed.
  • The method can reduce the number of FLOPs by up to ×295 times for sequences with more than one segment.


Terms Mentioned: Transformer, BERT, Recurrent Memory Transformer, natural language processing, memory storage, segment-level recurrence, FLOPs

Technologies / Libraries Mentioned: HuggingFace Transformers