Parallelism Basic #

1. Basics #

1. multi-GPU, multi-machine parallelism. #

The figure below shows a simplified overview of a training node with 8 GPUs, 2 CPUs, and connections to the InfiniBand network for multi-node communication. For example, GPT-NeoX-20B was trained on a cluster of 12 servers, each equipped with 8 NVIDIA A100 GPUs and 2 CPUs. Architecture Diagram

1. Components

CPUs (CPU₀, CPU₁)
- Two sockets per node (e.g., AMD EPYC).
- Provide PCIe lanes for GPUs and network adapters.
- Interconnected viaxGMI-2 (16x): ~16 GT/s per lane → allows CPU₀ and CPU₁ to share data.
PLX (PCIe Switches)
- PCI Express 4.0 switches that expand the number of lanes from each CPU.
- Each PLX connects CPU ↔ GPUs or CPU ↔ HCA.
- Bandwidth: 16 GT/s per lane × 16 lanes = ~32 GB/s per direction.
GPUs (GPU₀ … GPU₇)
- Eight NVIDIA GPUs per node (e.g., A100).
- Each GPU connects upward to a PLX (PCIe) and downward to the NVSwitch fabric.
- GPUs are the primary compute devices for training.
NVSwitch (NVSwitch₀ … NVSwitch₅)
- Dedicated crossbar switches for GPU-to-GPU communication.
- Each GPU connects to NVSwitch via NVLink 3.0 (2x links).
- Bandwidth per NVLink 3.0 lane: 400 GB/s → very fast intra-node GPU communication.
- Ensures all 8 GPUs form a fully connected high-bandwidth network inside the node.
HCA (Host Channel Adapter, HCA₀ … HCA₃)
- Network adapters (e.g., ConnectX-6) that connect the node to the InfiniBand fabric.
- Each HCA attaches to a CPU via PCIe 16x.
- Bandwidth per link: HDR InfiniBand ~50 GT/s per lane × 4 lanes = 200 GT/s (~25 GB/s).
- Supports GPUDirect RDMA → GPUs can directly send/receive data to remote GPUs without CPU involvement.
InfiniBand Switches (External, not inside the node)
- Purple “Switch₀ / Switch₁” labels indicate connections to external InfiniBand switches.
- These external devices interconnect all nodes in the cluster into a global high-speed fabric.
- Typically deployed in redundant pairs (two switches) for load-balancing and reliability.

Hierarchy of Communication Speeds
- Fastest (inside node): GPU ↔ GPU via NVSwitch/NVLink (~400 GB/s).
- Slower (across nodes): GPU ↔ remote GPU via InfiniBand (~50 GB/s).
Implications for Training
- Intra-node parallelism (tensor parallelism): relies on NVLink/NVSwitch for fast GPU-to-GPU synchronization.
- Inter-node parallelism (pipeline/data parallelism): relies on InfiniBand HCAs and switches, which are slower, so communication must be minimized.
- Design principle: keep heavy communication (activations, tensor splits) inside the node, and use lighter communication (gradients, parameters) across nodes.

2. Basics of Collective Communication #

Communication

Broadcast
- A single “root” rank provides input, and the value is copied to all other ranks.
- Cost: One device sends data of size N to all other P-1 devices, naive cost ~ O(P·N). Tree: ~O(log(P)N)
Reduce
- Each rank contributes its input; a reduction (e.g., sum) is computed; only the root rank receives the final result.
- Cost: All P devices each have data of size N, aggregate to one device, naive cost ~ O(PN). Tree: ~O(log(P)N)
All Gather
- Each rank contributes a unique slice of data. At the end, every rank receives the concatenation of all slices.
- Use Case: Parameters or activations are sharded across GPUs, then reconstructed.
- Cost: Each device has different N data, collect all P·N data to every device. Per-device cost: O((P-1)·N)
Reduce Scatter
- First, all ranks’ inputs are reduced (like All Reduce). Then, instead of every rank getting the full result, the reduced output is split into chunks and distributed.
- Each rank gets only its corresponding partition of the reduced result.
- Cost: All P devices have N data, reduce and scatter result: Per-device cost: (P-1)/P·N
All Reduce
- Each rank starts with its own input (e.g., gradients). The system computes a reduction (e.g., sum) across all inputs and distributes the result back to all ranks.
- Cost: All Reduce = Reduce Scatter + All Gather. naive cost per device ~ O(2(P-1)/PN)~O(2N)

3. TPUs vs GPUs #

TPU Networking – Toroidal Mesh
- 2D toroidal mesh (like a grid where edges wrap around).
- Each chip is directly connected only to its neighbors (left/right, up/down, plus wrap-around).
- To communicate with a faraway chip, data must hop through multiple intermediate chips.
- Advantage: scales well to very large numbers of chips, since each chip only needs a few links.
  - great for collective communication like all reduce
- Disadvantage: higher latency for faraway communication(multi-hop).
GPU Networking – All-to-All up to 256
- NVIDIA GPU clusters connect GPUs using switches (NVSwitch or InfiniBand).
- Topologies are designed for all-to-all connectivity: each GPU can (logically) communicate with any other GPU, usually in one hop.
- Up to 256 GPUs can be interconnected this way.
GPU SuperPODs: A100 vs H100
- A100 SuperPOD (blue, InfiniBand): Each DGX node (8 GPUs) is connected internally with NVSwitch. For inter-node communication, GPUs rely on InfiniBand switches arranged in a spine–leaf architecture. At cluster scale (32 nodes, 256 GPUs), the bisection bandwidth is about 6,400 GB/s, which becomes a limiting factor for large-scale training.
- H100 SuperPOD (green, NVLink Switch): Each DGX node is again internally connected with NVSwitch, but across nodes the GPUs now use dedicated NVLink Switches (NVS) instead of InfiniBand. This provides a massive jump in cluster-wide bandwidth: 57,600 GB/s at 256 GPUs. Cross-node communication is much closer to intra-node NVLink speeds, resulting in far better scaling efficiency.
summary
- With A100, once you scale to 256 GPUs, InfiniBand bandwidth becomes the bottleneck.
- With H100, the new NVLink Switch fabric keeps cross-node communication much faster, so scaling efficiency remains high.
- TPU mesh is different: it scales to thousands of chips but each communication may take multiple hops. GPUs instead aim for high-bandwidth all-to-all within a bounded scale (like 256).

2. Data parallelism, ZERO #

1. Naïve Data Parallelism #

We begin with the standard Adam optimizer update rule:

$g_t = \nabla_\theta f(\theta_t)$

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

def train_accumulate(params: ModelParameters, num_epochs, learning_rate, batch_size,beta1, beta2, eps, weight_decay):

    # Initialize moment estimates
    m_w = torch.zeros_like(params.w)
    v_w = torch.zeros_like(params.w)
    t = 0  # step counter

    for epoch in range(1, num_epochs+1):
        for index, (x, y_target) in enumerate(training_data):
            # Calculate the output of the model
            y_pred = x * params.w
            loss = (y_pred - y_target) ** 2

            # Calculate the gradients of the loss w.r.t. the parameters
            loss.backward()

            # Every time we reach the batch size or the end of the dataset, update the parameters
            if (index + 1) % batch_size == 0 or index == len(training_data) - 1:
                with torch.no_grad():
                    t += 1
                    # Compute biased first and second moment estimates
                    m_w = beta1 * m_w + (1 - beta1) * params.w.grad
                    v_w = beta2 * v_w + (1 - beta2) * (params.w.grad ** 2)

                    # Bias correction
                    m_w_hat = m_w / (1 - beta1 ** t)
                    v_w_hat = v_w / (1 - beta2 ** t)

                    # Update parameters with weight decay (AdamW)
                    # Equivalent to calling optimizer.step()
                    params.w -= learning_rate * (m_w_hat / (torch.sqrt(v_w_hat) + eps) + weight_decay * params.w)

                    # Reset the gradients to zero
                    # Equivalent to calling optimizer.zero_grad()
                    params.w.grad.zero_()

Split the batch of size (B) across (M) machines (each GPU processes (B/M) samples). After computing gradients locally, synchronize across GPUs by exchanging gradients.

Performance
- Compute scaling: Each GPU gets (B/M) examples, so computation divides evenly.
- Communication overhead: Every batch requires transmitting $($2 \times \text{params})$ for gradient synchronization. This is acceptable if batches are large.
- Memory scaling: No memory savings — each GPU still needs to hold a full copy of the model parameters.
Memory Breakdown
Depending on the precision used, the overhead looks like this:
- Model parameters: 2 bytes per parameter (FP16/BF16).
  - the actual learnable weights θ of the neural network.
  - In DDP, each GPU keeps its own replica of the parameters (so memory cost is multiplied across GPUs).
- Gradients: 2 bytes per parameter (FP16/BF16).
  - During backpropagation, each GPU computes gradients (\nabla f(x_i)) on its local mini-batch.
  - Before loss.backward() returns, DDP performs an all-reduce to average these gradients across all GPUs.
  - After synchronization, every GPU’s param.grad contains the same averaged value.
- Master weights (FP32): 4 bytes per parameter (used for SGD updates).
  - Even if we train in FP16/BF16 for speed, we cannot update weights directly in low precision (due to numerical instability). Therefore, we maintain a full FP32 copy of the model weights.
  - In DDP, each GPU keeps its own master weights, but since gradients are synchronized, updates remain consistent across devices.
- Adam first moment estimate: 4 bytes (or 2 in BF16) per parameter.
  - Each GPU stores its own copy, but they evolve identically since gradients are synchronized.
- Adam second moment estimate: 4 bytes (or 2 in BF16) per parameter.
Total Memory Cost
$$\text{Total memory} \approx 5 \times \text{params} \quad \Rightarrow \quad 16 \text{ bytes per parameter}$$

2 ZeRO (Zero Redundancy Optimizer) #

Idea: Shard optimizer states, gradients, parameters. ZERO

Stage 1: Optimizer state sharding, comm cost 2 × #params
Stage 2: Gradient sharding, comm cost 2 × #params
Stage 3 (FSDP): Shard everything (params too) 3 × #params

Stage 1: Shard (partition) the optimizer states #

Core Idea of Stage 1

Shard (partition) the optimizer states across GPUs.
Keep full parameters + gradients on every GPU, but divide optimizer states evenly across all devices.
Each GPU is only responsible for updating a subset of parameters corresponding to the optimizer state slice it owns.

Algorithm Flow

Forward + Backward (unchanged from DDP):
- Every GPU computes local forward + backward on its data.
- Gradients are all-reduced, so every GPU ends up with full averaged gradients for all parameters.
Optimizer State Partitioning (Reduce Scatter):
- Instead of every GPU keeping all Adam states:
  - First moment (m) and second moment (v) are partitioned across GPUs.
  - Example: with 4 GPUs, each holds 25% of (m, v).
Update Rule (per GPU):
- Each GPU updates only the parameters for which it owns optimizer states: $$\theta_i \leftarrow \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$$
- Here, (m_i, v_i) are only stored on the GPU responsible for slice (i).
Synchronization (All Gather):
- After updates, parameters are broadcast so that all GPUs have the same full model copy for the next forward pass.

Memory Consumption

$$2\Psi + 2\Psi + \frac{K \cdot \Psi}{N_d}$$

Params and grads still fully replicated on all GPUs
Optimizer states divided across (N_d) devices

Stage 2: Gradient Sharding #

Idea

Extend Stage 1 by also sharding gradients across GPUs.
Each GPU keeps only a slice of the gradients (in addition to optimizer state shard).

Complexity:

A full gradient vector is never instantiated in memory.
But since training is data parallel, each worker must still compute the full gradient locally during backpropagation — it just doesn’t keep it after ReduceScatter.
During backward pass, a temporary full gradient exists in memory, but it is freed immediately after ReduceScatter, so it doesn’t count toward persistent memory usage

Algorithm Flow

Incremental Backward Pass
- Each GPU goes backward on its mini-batch.
- After computing a layer’s gradients:
- Step 1a: Immediately do Reduce to send each gradient slice to the responsible GPU (suppose that specific layer corresponds to one specfic gpu).
- Step 1b: Once a gradient is reduced, free it from memory (since it’s no longer needed in the backward graph).
Local Optimizer Update
- Each GPU updates its parameter shard using:
- Its local gradient shard
- Its local optimizer state shard ((m_i, v_i)) $\theta_i ;\leftarrow; \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$
AllGather Parameters
- After updates, GPUs must have the full parameter set for the next forward pass.
- Use AllGather to share updated parameter shards.
Memory Consumption $$2\Psi + \frac{(2+K) \cdot \Psi}{N_d}$$

Stage 3: FSDP, shard everything #

FSDP High-Level Idea

Extend Stage 1 (shard optimizer states) + Stage 2 (shard gradients) by *also sharding parameters.
Every memory component (parameters, gradients, optimizer states) is partitioned across GPUs.
Parameters are requested on demand and then freed immediately after use.
This allows training models far beyond the memory capacity of a single GPU.

How It Works (Baby Version)

Load + Gather Parameters
- Each GPU stores only a shard of the parameters.
- Before computing a forward pass on a layer:
- Use AllGather to collect the full parameters needed for that layer.
Forward Computation
- Perform forward pass locally with the gathered full parameters.
- Once finished, free the parameters (keep only shard storage).
Backward Computation
- During backward pass, compute full gradients for the layer.
- Use ReduceScatter to distribute gradient shards to the responsible GPUs.
- Free gradients once they are scattered.
Optimizer Update
- Each GPU updates only its local parameter shard using its gradient shard + optimizer state shard.

Overlapping Communication and Computation

AllGather for layer (i+1) can happen in parallel with forward computation of layer (i).
Similarly, ReduceScatter for layer (i) can overlap with backward computation of layer (i-1).
This overlap masks communication cost, reducing overhead.

Communication Cost

For each iteration:
- 2 × AllGather (#params) (once in forward, once in backward).
- 1 × ReduceScatter (#params)** (for gradients).
Total communication ≈ same order as DDP, but memory footprint is dramatically smaller.

Memory Consumption

$$\frac{2\Psi + (2+K) \cdot \Psi}{N_d}$$

Limits of Data Parallelism #

Compute scaling limited by batch size.
Models still may not fit (activation memory not reduced).
- Parameters (weights):
  - Fixed tensors of the model (e.g. W in (y = Wx+b)).
  - They are the same across forward/backward and can be sharded with ZeRO.
- Activations:
  - The intermediate outputs of each layer during the forward pass.
  - Example for a 2-layer MLP: [ h_1 = W_1 x, \quad h_2 = \text{ReLU}(h_1), \quad y = W_2 h_2 ] Here (h_1, h_2) are activations.
- To compute gradients during backward, we need activations: Example: For ($y = W_2 h_2$), the gradient w.r.t. ($W_2$) is: $\nabla W_2 = h_2^T \cdot \nabla y$

Beyond Data Parallel – Model Parallelism #

TBD