Parallelism Basic

Parallelism Basic #

1. Basics #

1. multi-GPU, multi-machine parallelism. #

The figure below shows a simplified overview of a training node with 8 GPUs, 2 CPUs, and connections to the InfiniBand network for multi-node communication. For example, GPT-NeoX-20B was trained on a cluster of 12 servers, each equipped with 8 NVIDIA A100 GPUs and 2 CPUs. Architecture Diagram

1. Components
  1. CPUs (CPU₀, CPU₁)
    • Two sockets per node (e.g., AMD EPYC).
    • Provide PCIe lanes for GPUs and network adapters.
    • Interconnected viaxGMI-2 (16x): ~16 GT/s per lane → allows CPU₀ and CPU₁ to share data.
  2. PLX (PCIe Switches)
    • PCI Express 4.0 switches that expand the number of lanes from each CPU.
    • Each PLX connects CPU ↔ GPUs or CPU ↔ HCA.
    • Bandwidth: 16 GT/s per lane × 16 lanes = ~32 GB/s per direction.
  3. GPUs (GPU₀ … GPU₇)
    • Eight NVIDIA GPUs per node (e.g., A100).
    • Each GPU connects upward to a PLX (PCIe) and downward to the NVSwitch fabric.
    • GPUs are the primary compute devices for training.
  4. NVSwitch (NVSwitch₀ … NVSwitch₅)
    • Dedicated crossbar switches for GPU-to-GPU communication.
    • Each GPU connects to NVSwitch via NVLink 3.0 (2x links).
    • Bandwidth per NVLink 3.0 lane: 400 GB/s → very fast intra-node GPU communication.
    • Ensures all 8 GPUs form a fully connected high-bandwidth network inside the node.
  5. HCA (Host Channel Adapter, HCA₀ … HCA₃)
    • Network adapters (e.g., ConnectX-6) that connect the node to the InfiniBand fabric.
    • Each HCA attaches to a CPU via PCIe 16x.
    • Bandwidth per link: HDR InfiniBand ~50 GT/s per lane × 4 lanes = 200 GT/s (~25 GB/s).
    • Supports GPUDirect RDMA → GPUs can directly send/receive data to remote GPUs without CPU involvement.
  6. InfiniBand Switches (External, not inside the node)
    • Purple “Switch₀ / Switch₁” labels indicate connections to external InfiniBand switches.
    • These external devices interconnect all nodes in the cluster into a global high-speed fabric.
    • Typically deployed in redundant pairs (two switches) for load-balancing and reliability.
  1. Hierarchy of Communication Speeds

    • Fastest (inside node): GPU ↔ GPU via NVSwitch/NVLink (~400 GB/s).
    • Slower (across nodes): GPU ↔ remote GPU via InfiniBand (~50 GB/s).
  2. Implications for Training

    • Intra-node parallelism (tensor parallelism): relies on NVLink/NVSwitch for fast GPU-to-GPU synchronization.
    • Inter-node parallelism (pipeline/data parallelism): relies on InfiniBand HCAs and switches, which are slower, so communication must be minimized.
    • Design principle: keep heavy communication (activations, tensor splits) inside the node, and use lighter communication (gradients, parameters) across nodes.

2. Basics of Collective Communication #

Communication

  1. Broadcast

    • A single “root” rank provides input, and the value is copied to all other ranks.
    • Cost: One device sends data of size N to all other P-1 devices, naive cost ~ O(P·N). Tree: ~O(log(P)N)
  2. Reduce

    • Each rank contributes its input; a reduction (e.g., sum) is computed; only the root rank receives the final result.
    • Cost: All P devices each have data of size N, aggregate to one device, naive cost ~ O(PN). Tree: ~O(log(P)N)
  3. All Gather

    • Each rank contributes a unique slice of data. At the end, every rank receives the concatenation of all slices.
    • Use Case: Parameters or activations are sharded across GPUs, then reconstructed.
    • Cost: Each device has different N data, collect all P·N data to every device. Per-device cost: O((P-1)·N)
  4. Reduce Scatter

    • First, all ranks’ inputs are reduced (like All Reduce). Then, instead of every rank getting the full result, the reduced output is split into chunks and distributed.
    • Each rank gets only its corresponding partition of the reduced result.
    • Cost: All P devices have N data, reduce and scatter result: Per-device cost: (P-1)/P·N
  5. All Reduce

    • Each rank starts with its own input (e.g., gradients). The system computes a reduction (e.g., sum) across all inputs and distributes the result back to all ranks.
    • Cost: All Reduce = Reduce Scatter + All Gather. naive cost per device ~ O(2(P-1)/PN)~O(2N)

3. TPUs vs GPUs #

  1. TPU Networking – Toroidal Mesh

    • 2D toroidal mesh (like a grid where edges wrap around).
    • Each chip is directly connected only to its neighbors (left/right, up/down, plus wrap-around).
    • To communicate with a faraway chip, data must hop through multiple intermediate chips.
    • Advantage: scales well to very large numbers of chips, since each chip only needs a few links.
      • great for collective communication like all reduce
    • Disadvantage: higher latency for faraway communication(multi-hop).
  2. GPU Networking – All-to-All up to 256

    • NVIDIA GPU clusters connect GPUs using switches (NVSwitch or InfiniBand).
    • Topologies are designed for all-to-all connectivity: each GPU can (logically) communicate with any other GPU, usually in one hop.
    • Up to 256 GPUs can be interconnected this way.
  3. GPU SuperPODs: A100 vs H100

    • A100 SuperPOD (blue, InfiniBand): Each DGX node (8 GPUs) is connected internally with NVSwitch. For inter-node communication, GPUs rely on InfiniBand switches arranged in a spine–leaf architecture. At cluster scale (32 nodes, 256 GPUs), the bisection bandwidth is about 6,400 GB/s, which becomes a limiting factor for large-scale training.

    • H100 SuperPOD (green, NVLink Switch): Each DGX node is again internally connected with NVSwitch, but across nodes the GPUs now use dedicated NVLink Switches (NVS) instead of InfiniBand. This provides a massive jump in cluster-wide bandwidth: 57,600 GB/s at 256 GPUs. Cross-node communication is much closer to intra-node NVLink speeds, resulting in far better scaling efficiency.

  4. summary

    • With A100, once you scale to 256 GPUs, InfiniBand bandwidth becomes the bottleneck.
    • With H100, the new NVLink Switch fabric keeps cross-node communication much faster, so scaling efficiency remains high.
    • TPU mesh is different: it scales to thousands of chips but each communication may take multiple hops. GPUs instead aim for high-bandwidth all-to-all within a bounded scale (like 256).

2. Data parallelism, ZERO #

1. Naïve Data Parallelism #

We begin with the standard Adam optimizer update rule:

$g_t = \nabla_\theta f(\theta_t)$

$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$

$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$

$\theta_{t+1} = \theta_t - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

def train_accumulate(params: ModelParameters, num_epochs, learning_rate, batch_size,beta1, beta2, eps, weight_decay):

    # Initialize moment estimates
    m_w = torch.zeros_like(params.w)
    v_w = torch.zeros_like(params.w)
    t = 0  # step counter

    for epoch in range(1, num_epochs+1):
        for index, (x, y_target) in enumerate(training_data):
            # Calculate the output of the model
            y_pred = x * params.w
            loss = (y_pred - y_target) ** 2

            # Calculate the gradients of the loss w.r.t. the parameters
            loss.backward()

            # Every time we reach the batch size or the end of the dataset, update the parameters
            if (index + 1) % batch_size == 0 or index == len(training_data) - 1:
                with torch.no_grad():
                    t += 1
                    # Compute biased first and second moment estimates
                    m_w = beta1 * m_w + (1 - beta1) * params.w.grad
                    v_w = beta2 * v_w + (1 - beta2) * (params.w.grad ** 2)

                    # Bias correction
                    m_w_hat = m_w / (1 - beta1 ** t)
                    v_w_hat = v_w / (1 - beta2 ** t)

                    # Update parameters with weight decay (AdamW)
                    # Equivalent to calling optimizer.step()
                    params.w -= learning_rate * (m_w_hat / (torch.sqrt(v_w_hat) + eps) + weight_decay * params.w)

                    # Reset the gradients to zero
                    # Equivalent to calling optimizer.zero_grad()
                    params.w.grad.zero_()

Split the batch of size (B) across (M) machines (each GPU processes (B/M) samples). After computing gradients locally, synchronize across GPUs by exchanging gradients.

  1. Performance

    • Compute scaling: Each GPU gets (B/M) examples, so computation divides evenly.
    • Communication overhead: Every batch requires transmitting $($2 \times \text{params})$ for gradient synchronization. This is acceptable if batches are large.
    • Memory scaling: No memory savings — each GPU still needs to hold a full copy of the model parameters.
  2. Memory Breakdown

    Depending on the precision used, the overhead looks like this:

    • Model parameters: 2 bytes per parameter (FP16/BF16).
      • the actual learnable weights θ of the neural network.
      • In DDP, each GPU keeps its own replica of the parameters (so memory cost is multiplied across GPUs).
    • Gradients: 2 bytes per parameter (FP16/BF16).
      • During backpropagation, each GPU computes gradients (\nabla f(x_i)) on its local mini-batch.
      • Before loss.backward() returns, DDP performs an all-reduce to average these gradients across all GPUs.
      • After synchronization, every GPU’s param.grad contains the same averaged value.
    • Master weights (FP32): 4 bytes per parameter (used for SGD updates).
      • Even if we train in FP16/BF16 for speed, we cannot update weights directly in low precision (due to numerical instability). Therefore, we maintain a full FP32 copy of the model weights.
      • In DDP, each GPU keeps its own master weights, but since gradients are synchronized, updates remain consistent across devices.
    • Adam first moment estimate: 4 bytes (or 2 in BF16) per parameter.
      • Each GPU stores its own copy, but they evolve identically since gradients are synchronized.
    • Adam second moment estimate: 4 bytes (or 2 in BF16) per parameter.
  3. Total Memory Cost

    $$\text{Total memory} \approx 5 \times \text{params} \quad \Rightarrow \quad 16 \text{ bytes per parameter}$$

2 ZeRO (Zero Redundancy Optimizer) #

Idea: Shard optimizer states, gradients, parameters. ZERO

  • Stage 1: Optimizer state sharding, comm cost 2 × #params
  • Stage 2: Gradient sharding, comm cost 2 × #params
  • Stage 3 (FSDP): Shard everything (params too) 3 × #params

Stage 1: Shard (partition) the optimizer states #

Core Idea of Stage 1

  • Shard (partition) the optimizer states across GPUs.
  • Keep full parameters + gradients on every GPU, but divide optimizer states evenly across all devices.
  • Each GPU is only responsible for updating a subset of parameters corresponding to the optimizer state slice it owns.

Algorithm Flow

  1. Forward + Backward (unchanged from DDP):

    • Every GPU computes local forward + backward on its data.
    • Gradients are all-reduced, so every GPU ends up with full averaged gradients for all parameters.
  2. Optimizer State Partitioning (Reduce Scatter):

    • Instead of every GPU keeping all Adam states:

      • First moment (m) and second moment (v) are partitioned across GPUs.
      • Example: with 4 GPUs, each holds 25% of (m, v).
  3. Update Rule (per GPU):

    • Each GPU updates only the parameters for which it owns optimizer states: $$\theta_i \leftarrow \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$$
    • Here, (m_i, v_i) are only stored on the GPU responsible for slice (i).
  4. Synchronization (All Gather):

    • After updates, parameters are broadcast so that all GPUs have the same full model copy for the next forward pass.

Memory Consumption

$$2\Psi + 2\Psi + \frac{K \cdot \Psi}{N_d}$$
  • Params and grads still fully replicated on all GPUs
  • Optimizer states divided across (N_d) devices

Stage 2: Gradient Sharding #

Idea

  • Extend Stage 1 by also sharding gradients across GPUs.
  • Each GPU keeps only a slice of the gradients (in addition to optimizer state shard).

Complexity:

  • A full gradient vector is never instantiated in memory.
  • But since training is data parallel, each worker must still compute the full gradient locally during backpropagation — it just doesn’t keep it after ReduceScatter.
  • During backward pass, a temporary full gradient exists in memory, but it is freed immediately after ReduceScatter, so it doesn’t count toward persistent memory usage

Algorithm Flow

  1. Incremental Backward Pass
    • Each GPU goes backward on its mini-batch.
    • After computing a layer’s gradients:
    • Step 1a: Immediately do Reduce to send each gradient slice to the responsible GPU (suppose that specific layer corresponds to one specfic gpu).
    • Step 1b: Once a gradient is reduced, free it from memory (since it’s no longer needed in the backward graph).
  2. Local Optimizer Update
    • Each GPU updates its parameter shard using:
    • Its local gradient shard
    • Its local optimizer state shard ((m_i, v_i)) $\theta_i ;\leftarrow; \theta_i - \eta \cdot \frac{m_i}{\sqrt{v_i} + \epsilon}$
  3. AllGather Parameters
    • After updates, GPUs must have the full parameter set for the next forward pass.
    • Use AllGather to share updated parameter shards.
  4. Memory Consumption $$2\Psi + \frac{(2+K) \cdot \Psi}{N_d}$$

Stage 3: FSDP, shard everything #

FSDP High-Level Idea

  • Extend Stage 1 (shard optimizer states) + Stage 2 (shard gradients) by *also sharding parameters.
  • Every memory component (parameters, gradients, optimizer states) is partitioned across GPUs.
  • Parameters are requested on demand and then freed immediately after use.
  • This allows training models far beyond the memory capacity of a single GPU.

How It Works (Baby Version)

  1. Load + Gather Parameters
    • Each GPU stores only a shard of the parameters.
    • Before computing a forward pass on a layer:
    • Use AllGather to collect the full parameters needed for that layer.
  2. Forward Computation
    • Perform forward pass locally with the gathered full parameters.
    • Once finished, free the parameters (keep only shard storage).
  3. Backward Computation
    • During backward pass, compute full gradients for the layer.
    • Use ReduceScatter to distribute gradient shards to the responsible GPUs.
    • Free gradients once they are scattered.
  4. Optimizer Update
    • Each GPU updates only its local parameter shard using its gradient shard + optimizer state shard.

Overlapping Communication and Computation

  • AllGather for layer (i+1) can happen in parallel with forward computation of layer (i).
  • Similarly, ReduceScatter for layer (i) can overlap with backward computation of layer (i-1).
  • This overlap masks communication cost, reducing overhead.

Communication Cost

  • For each iteration:
    • 2 × AllGather (#params) (once in forward, once in backward).
    • 1 × ReduceScatter (#params)** (for gradients).
  • Total communication ≈ same order as DDP, but memory footprint is dramatically smaller.

Memory Consumption

$$\frac{2\Psi + (2+K) \cdot \Psi}{N_d}$$

Limits of Data Parallelism #

  • Compute scaling limited by batch size.
  • Models still may not fit (activation memory not reduced).
    • Parameters (weights):
      • Fixed tensors of the model (e.g. W in (y = Wx+b)).
      • They are the same across forward/backward and can be sharded with ZeRO.
    • Activations:
      • The intermediate outputs of each layer during the forward pass.
      • Example for a 2-layer MLP: [ h_1 = W_1 x, \quad h_2 = \text{ReLU}(h_1), \quad y = W_2 h_2 ] Here (h_1, h_2) are activations.
    • To compute gradients during backward, we need activations: Example: For ($y = W_2 h_2$), the gradient w.r.t. ($W_2$) is: $\nabla W_2 = h_2^T \cdot \nabla y$

Beyond Data Parallel – Model Parallelism #

TBD