Evaluating Long-Context LLMs
This paper proposes a novel way to evaluate large language models (LLMs) that claim to handle long contexts effectively. The researchers introduce a benchmark known as N, enhancing the traditional Needle-in-a-Haystack (NIAH) tests by eliminating literal matches between the search context and the re
Arxiv URL: https://arxiv.org/abs/2502.05167
Authors: Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Summary:
This paper proposes a novel way to evaluate large language models (LLMs) that claim to handle long contexts effectively. The researchers introduce a benchmark known as N, enhancing the traditional Needle-in-a-Haystack (NIAH) tests by eliminating literal matches between the search context and the relevant information. This means the model has to rely on associative reasoning rather than just finding exact matches, presenting a much harder challenge. The researchers evaluated 12 popular LLMs that suggest they can handle contexts of up to 128K tokens, showing that while models perform well with short contexts, their performance significantly drops as the context length increases. For instance, at 32K tokens, most models sit at only half their performance compared to shorter contexts. Even leading models like GPT-4o showed a drastic decrease from almost perfect scores to much lower accuracy as the contexts lengthen. To test these models' associative reasoning capabilities, the evaluation introduces 'needles' into a 'haystack' where there is minimal lexical overlap. This way, the models must infer and locate information based on latent associative links instead of literal cues. The study's findings highlight significant vulnerabilities in LLMs when the context lacks literal matches, impacting their ability to retrieve relevant information amid large amounts of irrelevant text. The paper emphasizes the importance of such an evaluation approach as it offers a better understanding of the limitations of current LLMs, particularly concerning long-context understanding. The insights gained from benchmarks like N could improve the reliability and accuracy of LLMs in practical applications, such as search engines and retrieval systems, where lexical mismatches with queries are common.
Key Insights & Learnings:
- LLMs perform well in short contexts but struggle as context length increases.
- Traditional benchmarks have limitations due to reliance on literal matches.
- N introduces tests without literal overlap, emphasizing latent reasoning.
- Most evaluated models show drastic performance drops for long contexts.
- The research highlights the need for better tools to assess and improve LLM reasoning.
Terms Mentioned: Long-context understanding, Needle-in-a-Haystack, Associative reasoning, Lexical overlap, Attention mechanism
Technologies / Libraries Mentioned: Llama, GPT-4o, Contriever