This paper presents a method for compressing prompts in large language models (LLMs) to accelerate model inference and reduce cost. The method involves a budget controller, a token-level iterative compression algorithm, and an instruction tuning based method for distribution alignment. Experimental
This paper introduces the Skeleton-of-Thought (SoT) method to decrease the generation latency of large language models (LLMs). SoT guides LLMs to first generate the skeleton of the answer and then conducts parallel API calls or batched decoding to complete the contents of each skeleton point. The m
This paper provides a comprehensive overview of the current state of knowledge on reasoning in Large Language Models (LLMs), including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous r
The paper proposes a prompt ensembling method for large language models called 'boosted prompting', which uses a small dataset to construct a set of few shot prompts that together comprise a boosted prompt ensemble. The few shot examples for each prompt are chosen in a stepwise fashion to be 'hard'
The paper discusses the cost associated with querying large language models (LLMs) and proposes FrugalGPT, a framework that uses LLM APIs to process natural language queries within a budget constraint. The framework uses prompt adaptation, LLM approximation, and LLM cascade to reduce the inference
The paper introduces a new mechanism called Distilling step-by-step that trains smaller models to outperform larger language models (LLMs) while using less training data and smaller model sizes. The mechanism extracts LLM rationales as additional supervision for small models within a multi-task tra
The paper proposes a multi-modal AI system named AudioGPT that complements Large Language Models (LLMs) with foundation models to process complex audio information and solve numerous understanding and generation tasks. AudioGPT is connected with an input/output interface (ASR, TTS) to support spoke
The paper discusses the limitations of large language models (LMs) and proposes a neuro-symbolic architecture called the Modular Reasoning, Knowledge and Language (MRKL) system that combines LMs with external knowledge sources and discrete reasoning modules to overcome these limitations.
The paper introduces ReAct, a novel prompt-based paradigm that synergizes reasoning and acting in language models for general task solving. ReAct generates both verbal reasoning traces and actions in an interleaved manner, allowing the model to perform dynamic reasoning to create, maintain, and adj
The paper introduces Chameleon, a plug-and-play compositional reasoning framework that augments large language models (LLMs) to address their inherent limitations and tackle a broad range of reasoning tasks. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shel