Scaling production AI: Cerebras joins the Portkey ecosystem Cerebras inference is now available on the Portkey AI Gateway,bringing ultra-fast performance with enterprise-grade governance and control.
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental