Mixtures of Experts

4. Mixtures of Experts #

What’s a MoE? #

  • Replace one big feedforward network with many big feedforward networks + a selector (router) layer
    • After attention, each token has its own hidden representation. Instead of applying the same FFN to every token, we let the router decide which expert FFN to apply.
  • Key insight: you can increase the number of experts without increasing FLOPs
  • Same FLOPs, more parameters ⇒ better performance [Fedus et al., 2022]
  • Faster to train [OlMoE]
  • Competitive vs dense equivalents
  • Parallelizable across many devices

What MoEs Generally Look Like #

  • Typical: replace MLP with MoE layer
  • Less common: MoE for attention heads
    • [ModuleFormer, JetMoE]
  • Key knobs:
    • Routing function
    • Expert sizes
    • Training objectives

Routing Functions #

Almost all MoEs do token-choice top-k routing.

  • Token chooses expert
  • Expert chooses token
  • Or global routing via optimization [Fedus et al., 2022]

Common Variants #

  • Top-k
  • Hashing (baseline) [Fedus et al., 2022]
  • RL-based routing (Bengio 2013) — rare today
  • Linear assignment (Clark ’22) — rare today

Top-k in Detail #

  • $u_t^l \in \mathbb{R}^{d}$: the token’s hidden state entering the MoE at layer $l$ (after attention & LN).
  • $N$: number of experts (FFN$_1$, …, FFN$_N$).
  • $e_i^l \in \mathbb{R}^{d}$: a learned router weight (one per expert) at layer $l$.
  • $K$: how many experts we activate per token (e.g., 1, 2, 4…).
  • $h_t^l \in \mathbb{R}^{d}$: the output of the MoE layer (with residual).
  1. Compute routing scores with a “logistic regressor”
$$ s_{i,t} \;=\; \mathrm{Softmax}_i\!\big((u_t^l)^{\!\top}e_i^l\big) $$
  1. Hard select top-K experts
$$ g_{i,t} \;=\; \begin{cases} s_{i,t}, & \text{if } s_{i,t} \in \mathrm{TopK}(\{s_{j,t}\}_{j=1}^N, K)\\ 0, & \text{otherwise} \end{cases} $$

3. Dispatch → experts → combine (with residual)

$$ h_t^l \;=\; \sum_{i=1}^{N} g_{i,t}\,\mathrm{FFN}_i(u_t^l)\;+\;u_t^l $$
  • Send $u_t^l$ only to the selected experts (others receive nothing).
  • Each selected expert $\mathrm{FFN}_i$ returns a vector in $\mathbb{R}^d$.
  • Weight those outputs by $g_{i,t}$, sum them, then add the residual $u_t^l$.

Routing Variations in Practice #

  • Shared experts: Always-on experts (DeepSeek, Qwen; idea from DeepSpeed MoE)
  • Fine-grained experts: Many smaller experts
  • Device-aware routing: e.g., DeepSeek v2 “Top-M device routing”

Training MoEs #

Major challenge: sparse routing decisions are not differentiable.

Approaches #

  1. Reinforcement learning (REINFORCE)
    • Works, but high variance and complexity [Clark et al., 2020]
  2. Stochastic perturbations
    • Gaussian noise [Shazeer et al., 2017]
    • Multiplicative jitter [Fedus et al., 2022; later removed in Zoph et al., 2022]
  3. Heuristic balancing losses
    • Ensure experts are used evenly
    • Switch Transformer [Fedus et al., 2022]
    • DeepSeek: per-expert and per-device balancing
    • DeepSeek v3: per-expert biases (“auxiliary-loss-free balancing”)

TBD