Parallelism Basic #
1. Basics #
1. multi-GPU, multi-machine parallelism. #
The figure below shows a simplified overview of a training node with 8 GPUs, 2 CPUs, and connections to the InfiniBand network for multi-node communication.
For example, GPT-NeoX-20B was trained on a cluster of 12 servers, each equipped with 8 NVIDIA A100 GPUs and 2 CPUs.
1. Components
- CPUs (CPU₀, CPU₁)
- Two sockets per node (e.g., AMD EPYC).
- Provide PCIe lanes for GPUs and network adapters.
- Interconnected viaxGMI-2 (16x): ~16 GT/s per lane → allows CPU₀ and CPU₁ to share data.
- PLX (PCIe Switches)
- PCI Express 4.0 switches that expand the number of lanes from each CPU.
- Each PLX connects CPU ↔ GPUs or CPU ↔ HCA.
- Bandwidth: 16 GT/s per lane × 16 lanes = ~32 GB/s per direction.
- GPUs (GPU₀ … GPU₇)
- Eight NVIDIA GPUs per node (e.g., A100).
- Each GPU connects upward to a PLX (PCIe) and downward to the NVSwitch fabric.
- GPUs are the primary compute devices for training.
- NVSwitch (NVSwitch₀ … NVSwitch₅)
- Dedicated crossbar switches for GPU-to-GPU communication.
- Each GPU connects to NVSwitch via NVLink 3.0 (2x links).
- Bandwidth per NVLink 3.0 lane: 400 GB/s → very fast intra-node GPU communication.
- Ensures all 8 GPUs form a fully connected high-bandwidth network inside the node.
- HCA (Host Channel Adapter, HCA₀ … HCA₃)
- Network adapters (e.g., ConnectX-6) that connect the node to the InfiniBand fabric.
- Each HCA attaches to a CPU via PCIe 16x.
- Bandwidth per link: HDR InfiniBand ~50 GT/s per lane × 4 lanes = 200 GT/s (~25 GB/s).
- Supports GPUDirect RDMA → GPUs can directly send/receive data to remote GPUs without CPU involvement.
- InfiniBand Switches (External, not inside the node)
- Purple “Switch₀ / Switch₁” labels indicate connections to external InfiniBand switches.
- These external devices interconnect all nodes in the cluster into a global high-speed fabric.
- Typically deployed in redundant pairs (two switches) for load-balancing and reliability.
Hierarchy of Communication Speeds
- Fastest (inside node): GPU ↔ GPU via NVSwitch/NVLink (~400 GB/s).
- Slower (across nodes): GPU ↔ remote GPU via InfiniBand (~50 GB/s).
Implications for Training
- Intra-node parallelism (tensor parallelism): relies on NVLink/NVSwitch for fast GPU-to-GPU synchronization.
- Inter-node parallelism (pipeline/data parallelism): relies on InfiniBand HCAs and switches, which are slower, so communication must be minimized.
- Design principle: keep heavy communication (activations, tensor splits) inside the node, and use lighter communication (gradients, parameters) across nodes.
2. Basics of Collective Communication #
Broadcast
- A single “root” rank provides input, and the value is copied to all other ranks.
- Cost: One device sends data of size N to all other P-1 devices, naive cost ~ O(P·N). Tree: ~O(log(P)N)
Reduce
- Each rank contributes its input; a reduction (e.g., sum) is computed; only the root rank receives the final result.
- Cost: All P devices each have data of size N, aggregate to one device, naive cost ~ O(PN). Tree: ~O(log(P)N)
All Gather
- Each rank contributes a unique slice of data. At the end, every rank receives the concatenation of all slices.
- Use Case: Parameters or activations are sharded across GPUs, then reconstructed.
- Cost: Each device has different N data, collect all P·N data to every device. Per-device cost: O((P-1)·N)
Reduce Scatter
- First, all ranks’ inputs are reduced (like All Reduce). Then, instead of every rank getting the full result, the reduced output is split into chunks and distributed.
- Each rank gets only its corresponding partition of the reduced result.
- Cost: All P devices have N data, reduce and scatter result: Per-device cost: (P-1)/P·N
All Reduce
- Each rank starts with its own input (e.g., gradients). The system computes a reduction (e.g., sum) across all inputs and distributes the result back to all ranks.
- Cost: All Reduce = Reduce Scatter + All Gather. naive cost per device ~ O(2(P-1)/PN)~O(2N)
3. TPUs vs GPUs #
TPU Networking – Toroidal Mesh
- 2D toroidal mesh (like a grid where edges wrap around).
- Each chip is directly connected only to its neighbors (left/right, up/down, plus wrap-around).
- To communicate with a faraway chip, data must hop through multiple intermediate chips.
- Advantage: scales well to very large numbers of chips, since each chip only needs a few links.
- great for collective communication like all reduce
- Disadvantage: higher latency for faraway communication(multi-hop).
GPU Networking – All-to-All up to 256
- NVIDIA GPU clusters connect GPUs using switches (NVSwitch or InfiniBand).
- Topologies are designed for all-to-all connectivity: each GPU can (logically) communicate with any other GPU, usually in one hop.
- Up to 256 GPUs can be interconnected this way.
GPU SuperPODs: A100 vs H100
A100 SuperPOD (blue, InfiniBand): Each DGX node (8 GPUs) is connected internally with NVSwitch. For inter-node communication, GPUs rely on InfiniBand switches arranged in a spine–leaf architecture. At cluster scale (32 nodes, 256 GPUs), the bisection bandwidth is about 6,400 GB/s, which becomes a limiting factor for large-scale training.
H100 SuperPOD (green, NVLink Switch): Each DGX node is again internally connected with NVSwitch, but across nodes the GPUs now use dedicated NVLink Switches (NVS) instead of InfiniBand. This provides a massive jump in cluster-wide bandwidth: 57,600 GB/s at 256 GPUs. Cross-node communication is much closer to intra-node NVLink speeds, resulting in far better scaling efficiency.
summary
- With A100, once you scale to 256 GPUs, InfiniBand bandwidth becomes the bottleneck.
- With H100, the new NVLink Switch fabric keeps cross-node communication much faster, so scaling efficiency remains high.
- TPU mesh is different: it scales to thousands of chips but each communication may take multiple hops. GPUs instead aim for high-bandwidth all-to-all within a bounded scale (like 256).
2. Data parallelism, ZERO #
1. Naïve Data Parallelism #
We begin with the standard Adam optimizer update rule:
$g_t = \nabla_\theta f(\theta_t)$
$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$
$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$
def train_accumulate(params: ModelParameters, num_epochs, learning_rate, batch_size,beta1, beta2, eps, weight_decay):
# Initialize moment estimates
m_w = torch.zeros_like(params.w)
v_w = torch.zeros_like(params.w)
t = 0 # step counter
for epoch in range(1, num_epochs+1):
for index, (x, y_target) in enumerate(training_data):
# Calculate the output of the model
y_pred = x * params.w
loss = (y_pred - y_target) ** 2
# Calculate the gradients of the loss w.r.t. the parameters
loss.backward()
# Every time we reach the batch size or the end of the dataset, update the parameters
if (index + 1) % batch_size == 0 or index == len(training_data) - 1:
with torch.no_grad():
t += 1
# Compute biased first and second moment estimates
m_w = beta1 * m_w + (1 - beta1) * params.w.grad
v_w = beta2 * v_w + (1 - beta2) * (params.w.grad ** 2)
# Bias correction
m_w_hat = m_w / (1 - beta1 ** t)
v_w_hat = v_w / (1 - beta2 ** t)
# Update parameters with weight decay (AdamW)
# Equivalent to calling optimizer.step()
params.w -= learning_rate * (m_w_hat / (torch.sqrt(v_w_hat) + eps) + weight_decay * params.w)
# Reset the gradients to zero
# Equivalent to calling optimizer.zero_grad()
params.w.grad.zero_()
Split the batch of size (B) across (M) machines (each GPU processes (B/M) samples). After computing gradients locally, synchronize across GPUs by exchanging gradients.
Performance
- Compute scaling: Each GPU gets (B/M) examples, so computation divides evenly.
- Communication overhead: Every batch requires transmitting $($2 \times \text{params})$ for gradient synchronization. This is acceptable if batches are large.
- Memory scaling: No memory savings — each GPU still needs to hold a full copy of the model parameters.
Memory Breakdown
Depending on the precision used, the overhead looks like this:
- Model parameters: 2 bytes per parameter (FP16/BF16).
- the actual learnable weights θ of the neural network.
- In DDP, each GPU keeps its own replica of the parameters (so memory cost is multiplied across GPUs).
- Gradients: 2 bytes per parameter (FP16/BF16).
- During backpropagation, each GPU computes gradients (\nabla f(x_i)) on its local mini-batch.
- Before
loss.backward()returns, DDP performs an all-reduce to average these gradients across all GPUs. - After synchronization, every GPU’s
param.gradcontains the same averaged value.
- Master weights (FP32): 4 bytes per parameter (used for SGD updates).
- Even if we train in FP16/BF16 for speed, we cannot update weights directly in low precision (due to numerical instability). Therefore, we maintain a full FP32 copy of the model weights.
- In DDP, each GPU keeps its own master weights, but since gradients are synchronized, updates remain consistent across devices.
- Adam first moment estimate: 4 bytes (or 2 in BF16) per parameter.
- Each GPU stores its own copy, but they evolve identically since gradients are synchronized.
- Adam second moment estimate: 4 bytes (or 2 in BF16) per parameter.
- Model parameters: 2 bytes per parameter (FP16/BF16).
Total Memory Cost
$$\text{Total memory} \approx 5 \times \text{params} \quad \Rightarrow \quad 16 \text{ bytes per parameter}$$
2 ZeRO (Zero Redundancy Optimizer) #
Idea: Shard optimizer states, gradients, parameters.
- Stage 1: Optimizer state sharding, comm cost
2 × #params - Stage 2: Gradient sharding, comm cost
2 × #params - Stage 3 (FSDP): Shard everything (params too)
3 × #params
Stage 1: Shard (partition) the optimizer states #
Core Idea of Stage 1
- Shard (partition) the optimizer states across GPUs.
- Keep full parameters + gradients on every GPU, but divide optimizer states evenly across all devices.
- Each GPU is only responsible for updating a subset of parameters corresponding to the optimizer state slice it owns.
Algorithm Flow
Forward + Backward (unchanged from DDP):
- Every GPU computes local forward + backward on its data.
- Gradients are all-reduced, so every GPU ends up with full averaged gradients for all parameters.
Optimizer State Partitioning (Reduce Scatter):
Instead of every GPU keeping all Adam states:
- First moment (m) and second moment (v) are partitioned across GPUs.
- Example: with 4 GPUs, each holds 25% of (m, v).
Update Rule (per GPU):
- Each GPU updates only the parameters for which it owns optimizer states: $$\theta_i \leftarrow \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$$
- Here, (m_i, v_i) are only stored on the GPU responsible for slice (i).
Synchronization (All Gather):
- After updates, parameters are broadcast so that all GPUs have the same full model copy for the next forward pass.
Memory Consumption
$$2\Psi + 2\Psi + \frac{K \cdot \Psi}{N_d}$$- Params and grads still fully replicated on all GPUs
- Optimizer states divided across (N_d) devices
Stage 2: Gradient Sharding #
Idea
- Extend Stage 1 by also sharding gradients across GPUs.
- Each GPU keeps only a slice of the gradients (in addition to optimizer state shard).
Complexity:
- A full gradient vector is never instantiated in memory.
- But since training is data parallel, each worker must still compute the full gradient locally during backpropagation — it just doesn’t keep it after ReduceScatter.
- During backward pass, a temporary full gradient exists in memory, but it is freed immediately after ReduceScatter, so it doesn’t count toward persistent memory usage
Algorithm Flow
- Incremental Backward Pass
- Each GPU goes backward on its mini-batch.
- After computing a layer’s gradients:
- Step 1a: Immediately do Reduce to send each gradient slice to the responsible GPU (suppose that specific layer corresponds to one specfic gpu).
- Step 1b: Once a gradient is reduced, free it from memory (since it’s no longer needed in the backward graph).
- Local Optimizer Update
- Each GPU updates its parameter shard using:
- Its local gradient shard
- Its local optimizer state shard ((m_i, v_i)) $\theta_i ;\leftarrow; \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$
- AllGather Parameters
- After updates, GPUs must have the full parameter set for the next forward pass.
- Use AllGather to share updated parameter shards.
- Memory Consumption $$2\Psi + \frac{(2+K) \cdot \Psi}{N_d}$$
Stage 3: FSDP, shard everything #
High-Level Idea
- Extend Stage 1 (shard optimizer states) + Stage 2 (shard gradients) by *also sharding parameters.
- Every memory component (parameters, gradients, optimizer states) is partitioned across GPUs.
- Parameters are requested on demand and then freed immediately after use.
- This allows training models far beyond the memory capacity of a single GPU.
How It Works (Baby Version)
- Load + Gather Parameters
- Each GPU stores only a shard of the parameters.
- Before computing a forward pass on a layer:
- Use AllGather to collect the full parameters needed for that layer.
- Forward Computation
- Perform forward pass locally with the gathered full parameters.
- Once finished, free the parameters (keep only shard storage).
- Backward Computation
- During backward pass, compute full gradients for the layer.
- Use ReduceScatter to distribute gradient shards to the responsible GPUs.
- Free gradients once they are scattered.
- Optimizer Update
- Each GPU updates only its local parameter shard using its gradient shard + optimizer state shard.
Overlapping Communication and Computation
- AllGather for layer (i+1) can happen in parallel with forward computation of layer (i).
- Similarly, ReduceScatter for layer (i) can overlap with backward computation of layer (i-1).
- This overlap masks communication cost, reducing overhead.
Communication Cost
- For each iteration:
- 2 × AllGather (#params) (once in forward, once in backward).
- 1 × ReduceScatter (#params)** (for gradients).
- Total communication ≈ same order as DDP, but memory footprint is dramatically smaller.
Memory Consumption
$$\frac{2\Psi + (2+K) \cdot \Psi}{N_d}$$Limits of Data Parallelism #
- Compute scaling limited by batch size.
- Models still may not fit (activation memory not reduced).
- Parameters (weights):
- Fixed tensors of the model (e.g. W in (y = Wx+b)).
- They are the same across forward/backward and can be sharded with ZeRO.
- Activations:
- The intermediate outputs of each layer during the forward pass.
- Example for a 2-layer MLP: [ h_1 = W_1 x, \quad h_2 = \text{ReLU}(h_1), \quad y = W_2 h_2 ] Here (h_1, h_2) are activations.
- To compute gradients during backward, we need activations: Example: For ($y = W_2 h_2$), the gradient w.r.t. ($W_2$) is: $\nabla W_2 = h_2^T \cdot \nabla y$
- Parameters (weights):
Beyond Data Parallel – Model Parallelism #
TBD