state-of-the-art performance

Scaling production AI: Cerebras joins the Portkey ecosystem

Cerebras inference is now available on the Portkey AI Gateway,bringing ultra-fast performance with enterprise-grade governance and control.

paper summaries

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - Summary

This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental