LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary
This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental