Arxiv URL: https://arxiv.org/abs/2304.11062v1
Authors: Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
The paper presents a method to extend the context length of BERT, a Transformer-based model in natural language processing, by incorporating token-based memory storage and segment-level recurrence with recurrent memory (RMT). The method enables the model to store task-specific information across up to 2 million tokens, significantly exceeding the largest input size reported for transformer models. The paper also demonstrates the effectiveness of the approach in enhancing long-term dependency handling in natural language understanding and generation tasks as well as enabling large-scale context processing for memory-intensive applications.
Key Insights & Learnings:
- The method enables BERT to store task-specific information across up to 2 million tokens, significantly exceeding the largest input size reported for transformer models.
- The method maintains the base model's memory size at 3.6 GB in the experiments.
- The method allows for the storage and processing of both local and global information and enables information flow between segments of the input sequence through the use of recurrence.
- The method scales linearly for any model size if the segment length is fixed.
- The method can reduce the number of FLOPs by up to ×295 times for sequences with more than one segment.
Terms Mentioned: Transformer, BERT, Recurrent Memory Transformer, natural language processing, memory storage, segment-level recurrence, FLOPs
Technologies / Libraries Mentioned: HuggingFace Transformers