4. Mixtures of Experts #

What’s a MoE? #

Replace one big feedforward network with many big feedforward networks + a selector (router) layer
- After attention, each token has its own hidden representation. Instead of applying the same FFN to every token, we let the router decide which expert FFN to apply.
Key insight: you can increase the number of experts without increasing FLOPs

Almost all MoEs do token-choice top-k routing.

$u_t^l \in \mathbb{R}^{d}$: the token’s hidden state entering the MoE at layer $l$ (after attention & LN).
$N$: number of experts (FFN$_1$, …, FFN$_N$).
$e_i^l \in \mathbb{R}^{d}$: a learned router weight (one per expert) at layer $l$.
$K$: how many experts we activate per token (e.g., 1, 2, 4…).
$h_t^l \in \mathbb{R}^{d}$: the output of the MoE layer (with residual).

$$ s_{i,t} \;=\; \mathrm{Softmax}_i\!\big((u_t^l)^{\!\top}e_i^l\big) $$

$$ g_{i,t} \;=\; \begin{cases} s_{i,t}, & \text{if } s_{i,t} \in \mathrm{TopK}(\{s_{j,t}\}_{j=1}^N, K)\\ 0, & \text{otherwise} \end{cases} $$

3. Dispatch → experts → combine (with residual)

$$ h_t^l \;=\; \sum_{i=1}^{N} g_{i,t}\,\mathrm{FFN}_i(u_t^l)\;+\;u_t^l $$

Major challenge: sparse routing decisions are not differentiable.

Reinforcement learning (REINFORCE)
- Works, but high variance and complexity [Clark et al., 2020]
Stochastic perturbations
- Gaussian noise [Shazeer et al., 2017]
- Multiplicative jitter [Fedus et al., 2022; later removed in Zoph et al., 2022]
Heuristic balancing losses
- Ensure experts are used evenly
- Switch Transformer [Fedus et al., 2022]
- DeepSeek: per-expert and per-device balancing
- DeepSeek v3: per-expert biases (“auxiliary-loss-free balancing”)

TBD