Mixtral of Experts - Summary

The paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that outperforms existing models like Llama 2 70B and GPT-3.5 on various benchmarks. It uses a routing network to select two experts per token, allowing access to 47B parameters while actively using only 13B, enhan

Mixtral of Experts - Summary

Arxiv URL: https://arxiv.org/abs/2401.04088

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed

Summary:

The paper introduces Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model that outperforms existing models like Llama 2 70B and GPT-3.5 on various benchmarks. It uses a routing network to select two experts per token, allowing access to 47B parameters while actively using only 13B, enhancing inference speed and efficiency. Mixtral demonstrates exceptional performance in mathematics, code generation, and multilingual tasks. Additionally, Mixtral 8x7B – Instruct, a fine-tuned version, surpasses other models in following instructions. Both models are released under the Apache 2.0 license, and contributions have been made to the vLLM project for open-source integration.

Key Insights & Learnings:

  • Mixtral 8x7B uses a sparse mixture of experts architecture, selecting two experts per token, which allows for faster inference and higher throughput.
  • Despite having access to 47B parameters, Mixtral actively uses only 13B parameters during inference, significantly improving efficiency.
  • Mixtral outperforms Llama 2 70B and GPT-3.5 in mathematics, code generation, and multilingual benchmarks, showing superior capabilities in these areas.
  • The fine-tuned version, Mixtral 8x7B – Instruct, exceeds the performance of other leading models in instruction-following tasks and shows reduced biases.
  • Mixtral and its fine-tuned variant are released under the Apache 2.0 license, promoting broad accessibility and potential for diverse applications.


Terms Mentioned: Sparse Mixture of Experts (SMoE), Router Network, Experts, Inference, Parameters, Multilingual Benchmarks, Fine-tuning, Direct Preference Optimization (DPO), Apache 2.0 license, vLLM project, Megablocks CUDA kernels, Skypilot, Expert Parallelism (EP), Transformer Architecture, SwiGLU, GShard, Bias Benchmarks

Technologies / Libraries Mentioned: Megablocks, Skypilot, TensorRT-LLM, Triton