4. Mixtures of Experts #
What’s a MoE? #
- Replace one big feedforward network with many big feedforward networks + a selector (router) layer
- After attention, each token has its own hidden representation. Instead of applying the same FFN to every token, we let the router decide which expert FFN to apply.
- Key insight: you can increase the number of experts without increasing FLOPs
Why are MoEs popular? #
- Same FLOPs, more parameters ⇒ better performance [Fedus et al., 2022]
- Faster to train [OlMoE]
- Competitive vs dense equivalents
- Parallelizable across many devices
What MoEs Generally Look Like #
- Typical: replace MLP with MoE layer
- Less common: MoE for attention heads
- [ModuleFormer, JetMoE]
- Key knobs:
- Routing function
- Expert sizes
- Training objectives
Routing Functions #
Almost all MoEs do token-choice top-k routing.
- Token chooses expert
- Expert chooses token
- Or global routing via optimization [Fedus et al., 2022]
Common Variants #
- Top-k
- Hashing (baseline) [Fedus et al., 2022]
- RL-based routing (Bengio 2013) — rare today
- Linear assignment (Clark ’22) — rare today
Top-k in Detail #
- $u_t^l \in \mathbb{R}^{d}$: the token’s hidden state entering the MoE at layer $l$ (after attention & LN).
- $N$: number of experts (FFN$_1$, …, FFN$_N$).
- $e_i^l \in \mathbb{R}^{d}$: a learned router weight (one per expert) at layer $l$.
- $K$: how many experts we activate per token (e.g., 1, 2, 4…).
- $h_t^l \in \mathbb{R}^{d}$: the output of the MoE layer (with residual).
- Compute routing scores with a “logistic regressor”
- Hard select top-K experts
3. Dispatch → experts → combine (with residual)
$$ h_t^l \;=\; \sum_{i=1}^{N} g_{i,t}\,\mathrm{FFN}_i(u_t^l)\;+\;u_t^l $$- Send $u_t^l$ only to the selected experts (others receive nothing).
- Each selected expert $\mathrm{FFN}_i$ returns a vector in $\mathbb{R}^d$.
- Weight those outputs by $g_{i,t}$, sum them, then add the residual $u_t^l$.
Routing Variations in Practice #
- Shared experts: Always-on experts (DeepSeek, Qwen; idea from DeepSpeed MoE)
- Fine-grained experts: Many smaller experts
- Device-aware routing: e.g., DeepSeek v2 “Top-M device routing”
Training MoEs #
Major challenge: sparse routing decisions are not differentiable.
Approaches #
- Reinforcement learning (REINFORCE)
- Works, but high variance and complexity [Clark et al., 2020]
- Stochastic perturbations
- Gaussian noise [Shazeer et al., 2017]
- Multiplicative jitter [Fedus et al., 2022; later removed in Zoph et al., 2022]
- Heuristic balancing losses
- Ensure experts are used evenly
- Switch Transformer [Fedus et al., 2022]
- DeepSeek: per-expert and per-device balancing
- DeepSeek v3: per-expert biases (“auxiliary-loss-free balancing”)
TBD