LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary

This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary

Arxiv URL: https://arxiv.org/abs/2310.05736

Authors: Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu

Summary:

This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental results show that the proposed approach achieves state-of-the-art performance with up to 20x compression.

Key Insights & Learnings:

  • The paper introduces a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost.
  • The method involves a budget controller to allocate compression ratios, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment.
  • Experimental results demonstrate that the proposed approach achieves state-of-the-art performance.
  • The approach allows for up to 20x compression with little performance loss.
  • The method is validated on four datasets from different domains, showing its effectiveness across various scenarios.


Terms Mentioned: large language models, LLMs, prompt compression, model inference, budget controller, token-level iterative compression, instruction tuning, distribution alignment, state-of-the-art performance, compression ratios, experimental results, datasets, domains

Technologies / Libraries Mentioned: Microsoft Corporation