@wolfeMixtureExpertsMoELLMs2025

> Wolfe, Cameron R. 'Mixture-of-Experts (MoE) LLMs'. _Deep (Learning) Focus_, 27 Jan. 2025, [https://cameronrwolfe.substack.com/p/moe-llms](https://cameronrwolfe.substack.com/p/moe-llms). # Mixture-of-Experts (MoE) LLMs - For the basic principles of MoE LLMs, see [[@wolfeMixtureExpertsMoEBirth2024|Mixture-of-Experts (MoE): The Birth and Rise of Conditional Computation]]. - Here, we discuss techincal details of MoE implementations, such as the auxiliary loss-functions used in MoE models. ## Importance Loss - To encourage a balanced selection of experts during training, we can simply add an additional constraint to the training loss that rewards the model for uniformly leveraging each of its experts. - This can be done by defining an "importance" score for each expert. The importance score is based upon the probability predicted for each expert by the routing mechanism. - Given a batch of data, we compute importance by taking a sum of the probabilities assigned to each expert across all tokens in the batch. - Then, to determine if these probabilities are balanced, we can take the squared [coefficient of variation (CV)](https://en.wikipedia.org/wiki/Coefficient_of_variation) of the expert importance scores. - Put simply, **the CV will be a small value if all experts have similar importance scores and vice versa**. ![Details of computing the importance loss|600](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74fd12a9-7327-47e3-b7b7-400e801bf5c8_2640x1354.png) ## Load Balancing - Although the importance loss described above is useful, just because experts are assigned equal importance does not mean that tokens are routed uniformly. - For example, experts would have equal importance with: - A few tokens that assign them very high probability; - A much larger number of tokens that assign lower probability. - To solve this problem, we can create a single auxiliary loss term that captures both expert importance and load balancing, defined as the equal routing of tokens between each of the experts. This loss usually considers two quantities: 1. The fraction of router probability allocated to each expert; 2. The fraction of tokens dispatched to each expert. ![The expert load balancing loss|600](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F644cec45-8dff-491e-9d41-e53ee4b0c7df_1574x764.png) ## Router $z$-loss - The auxiliary load balancing loss described above is widely used throughout the MoE literature, but some authors propose an extra auxiliary loss term, called the **router $z$-loss**, that can further improve training stability. - The router z-loss constrains the size of the logits—**not the probabilities, so this is before softmax is applied**—predicted by the routing mechanism. ![The router-$z$ loss|600](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1790688e-5328-45f2-98c0-717ba6041470_2090x636.png) ## Expert Capacity - The computation performed in an MoE layer is dynamic due to the routing decisions made during both training and inference. - However, when we look at most practical implementations of sparse models, we will see that **==they usually have static batch sizes==**—**this is a useful trick for improving hardware utilization**. - To formalize the fixed batch size that we set for each expert, we can define the **expert capacity**: $ \textrm{Expert Capacity} \coloneqq \frac{\textrm{Total Batch Tokens}}{N} \cdot \textrm{Capacity Factor} $ - The expert capacity defines the maximum number of tokens in a batch that can be sent to each expert. - If the number of tokens routed to an expert exceeds the expert capacity, then we just "drop" these extra tokens. - More specifically, we perform no computation for these tokens and let their representation flow directly to the next layer via the transformer's residual connection. ![|600](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F417c5fc8-2524-48e1-a9ef-460b4476d323_1784x1184.png) - Interestingly, **MoE models tend to perform well with relatively low capacity factors (between 1 and 2)**. ## Computing the Output 1. Send the tokens to their selected experts. 2. Compute the output of the experts for these tokens. 3. Take a weighted average of expert outputs, where the weights are simply the probabilities assigned to each expert by the router. ![Computing the output of an MoE layer|500](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf751012-739f-4bac-92be-4e22cce61e9d_1368x648.png) ## Shared Experts - We have two groups of experts—**shared experts and routed experts**. - All tokens are always passed through the shared experts. - Tokens are passed through the routed experts according to a normal MoE routing mechanism. - Usually, the number of shared experts must be lower than the number of routed experts—**increasing the number of shared experts degrades the sparsity benefits of the MoE**. - ==**The motivation behind using shared experts is minimizing the amount of redundant information between experts.**== - By having a set of shared experts, **we can allow the network to store shared information within these experts**, rather than having to replicate the same information across several different experts. ![Shared vs. routed experts|500](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ec49e67-8f67-4eea-8759-c27231ffacf5_1212x628.png) ![Computing output of an MoE layer with shared experts|500](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aceff4e-ff4c-4729-a628-a6ac28bfa600_1198x406.png) ## MoE Language Models - DeepSeek-AI, et al. _DeepSeek-V3 Technical Report_. arXiv:2412.19437, arXiv, 18 Feb. 2025. _arXiv.org_, [https://doi.org/10.48550/arXiv.2412.19437](https://doi.org/10.48550/arXiv.2412.19437). - DeepSeek-AI, et al. _DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model_. arXiv:2405.04434, arXiv, 19 June 2024. _arXiv.org_, [https://doi.org/10.48550/arXiv.2405.04434](https://doi.org/10.48550/arXiv.2405.04434). - 'Introducing DBRX: A New State-of-the-Art Open LLM'. _Databricks_, 27 Mar. 2024, [https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm). - Xue, Fuzhao, et al. _OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models_. arXiv:2402.01739, arXiv, 27 Mar. 2024. _arXiv.org_, [https://doi.org/10.48550/arXiv.2402.01739](https://doi.org/10.48550/arXiv.2402.01739). - _Open Release of Grok-1 | xAI_. 17 Mar. 2024, [https://x.ai/news/grok-os](https://x.ai/news/grok-os). - Jiang, Albert Q., et al. _Mixtral of Experts_. arXiv:2401.04088, arXiv, 8 Jan. 2024. _arXiv.org_, [https://doi.org/10.48550/arXiv.2401.04088](https://doi.org/10.48550/arXiv.2401.04088).