CPU, GPU, CUDA #

CPU/Central Processing Unit #

cpu is a bit ambiguous. See below for details on hierarchy.

1. Hierarchy of Parallelism in Computing

1.1. Single Laptop/Desktop #

Generally has one processor/ one socket/ one physical chip. That processor has multiple cores, each running 1–2 threads.
Example: MacBook Pro M2 → 1 socket, 10 CPU cores, 10 threads.

1.2. Clusters (HPC / Cloud) #

Cluster = multiple machines (nodes) networked together. Managed by a scheduler (Slurm, PBS, Kubernetes).
Node = an single computer in a cluster, consisting of one or more sockets.
Socket = the physical package (processor) which contains multiple cores sharing the same memory.
Core = the smallest unit of computing, having one or more hardware threads and is responsible for executing instructions.
Thread
- Software thread: a sequence of instructions scheduled by OS, all sharing the same memory address space.
- Hardware thread: the hardware context in which one software thread runs.

2. Memory Hierarchy

Registers: per-core, nanosecond speed.
Cache (L1, L2, L3): closer to CPU, faster than RAM.
- a relatively small amount of very fast memory (compared to RAM), on the processor chip (die). It is used to fetch and hold data from the main memory near to the cores working on them.
- A modern processor has three cache levels: L1 and L2 are local to each core, while L3 (or Last Level Cache (LLC)) is shared among all cores of a CPU.
RAM (DRAM): node-level working memory.
- RAM is used as working memory for the cores. RAM is volatile memory, losing all content when the power is lost or switched off. In general, RAM is shared between all sockets on a node (all sockets can access all RAM)
Disk/SSD: permanent storage.
Parallel File System (Lustre, GPFS): cluster-level storage.

3. Parallel Programming Models

Two main ways to do parallelizm on CPU

OpenMP (Open Multi-Processing): within a node (shared memory, threads).
- OpenMP assigns threads to CPU cores; all threads share the same memory space.
MPI (Message Passing Interface): across nodes (distributed memory).
- Memory is segmented by processes; each process has private memory, and data must be explicitly passed between processes.
Hybrid: MPI between nodes + OpenMP threads inside each node.

GPU #

1. GPU vs CPU #

map multi CPU (∼ 10−100 cores), GPUs (∼ 5000 cores)

1. CPU

Scale: O(10) ALUs with a broad instruction set (versatile, latency-optimized).
Core level: Control + ALUs + L1 cache.
- ALU: arithmetic/logic execution unit.
- Control: fetch, decode, schedule instructions.
Chip (socket) level:
- Sophisticated Control Units.
- Local Per Core L1 Cache.
- Per Core L2 Cache.
- Many cores + shared last-level cache (e.g., L3) + memory controllers.
System/node level: Socket + DRAM modules plugged into the motherboard.

2. GPU

Scale: O(10^9) ALUs with a limited instruction set.
- Optimized for throughput (massive parallelism), not single-thread performance.
Chip level:
- Basic Controle Units
- Local Per Thread L1 Cache
- Per Block L2 Cache
Off Chip DRAM

2.1 GPU Model

Software abstraction (CUDA programming model)

Kernel
- A GPU function launched by the CPU (host).
- Runs in parallel on many threads of the GPU.
Grid
- The collection of all blocks launched for a kernel.
- Represents the full set of threads running the kernel across all SMs of the GPU device.
Block
- A group of threads (e.g., 128, 256, 512 threads).
- All threads in a block are scheduled on a single Streaming Multiprocessor (SM). (One SM can host multiple blocks concurrently if resources allow).
- Threads in the same block can cooperate via shared memory and synchronize with __syncthreads().
Warp
- Fixed group of 32 threads executed together in lockstep (SIMT).
- If threads diverge (e.g., different if/else paths), execution is serialized (warp divergence).
Thread
- Smallest execution unit.
- Each thread executes one instance of the kernel function.
- Threads are executed in warps of 32. Each thread in a warp runs on one Streaming Processor (SP), also called a CUDA core, within an SM.
- An SP is essentially an ALU inside the GPU.

Hardware level (GPU architecture) A100 Each GPU chip has a fixed number of Streaming Multiprocessors (SMs). The number of SMs × CUDA cores per SM = total CUDA cores.

Example: 1 A100 = 108 SMs, 64 CUDA cores per SM = 4352 total CUDA cores.

Streaming Multiprocessor (SM)
- A physical compute unit on the GPU chip.
- Contains:
  - Many Streaming Processors (SPs) (e.g., 64–128 CUDA cores per SM, depending on architecture).
  - Specialized units (load/store, SFUs for transcendental functions, tensor cores, registers, shared memory, warp schedulers).
- An SM schedules and executes warps (groups of 32 threads) using its internal SPs.
- Multiple blocks can run on the same SM concurrently, as long as resources (registers, shared memory) allow.
Streaming Processor (SP / CUDA Core)
- Executes one thread’s instructions (like an ALU in a CPU).
- Each SM has many SPs working in parallel.

Visual Hierarchy (conceptual)

Grid
 └── Blocks
      └── Warps (32 threads each)
           └── Threads
                └── Mapped to SPs (CUDA cores)
SMs (hardware level)
 └── Host multiple warps/blocks
 └── Contain SPs, registers, shared memory, schedulers

Kernel launch
   ↓
Grid (software: all blocks)
   ↓
Blocks (software: each mapped to 1 SM)
   ↓
SM (hardware: runs many blocks concurrently)
   ↓
Warps (32 threads per warp, scheduled by SM)
   ↓
Threads (software)
   ↓
SPs (hardware execution units = CUDA cores)

gpu

2.2 Host–Device Interaction (Heterogeneous Computing)

Program execution split between CPU (host) and GPU (device).
Workflow:
1. CPU prepares data in host memory.
2. Data is copied to GPU memory over PCIe bus.
3. GPU executes massively parallel kernel functions.
4. Results are copied back to the CPU.
PCIe bus bottleneck:
- High overhead when transferring small chunks of data.
- GPU performance gain is visible only for large problem sizes.

2.3 Performance Considerations

Occupancy: near-optimal GPU performance requires utilizing a large fraction of ALUs*.
Bandwidth: The maximum rate at which data can be transferred between memory and the processor.
- Units: bytes/second (e.g., 900 GB/s on an NVIDIA A100).
- Often limited by hardware interfaces (e.g., GPU DRAM speed, PCIe bus).
- If an algorithm requires frequent memory loads/stores relative to computation, its speed will be memory-bound (performance capped by bandwidth).
Arithmetic Intensity (AI): $$ \text{Arithmetic Intensity} = \frac{\text{Number of Floating Point Operations (FLOPs)}}{\text{Bytes of Data Moved}} $$
- Units: FLOPs/byte.
- High AI → compute-intensive (performance limited by GPU compute capability).
- Low AI → memory-intensive (performance limited by bandwidth).
Example: SAXPY(Single-precision A·X Plus Y) kernel $$ y_i \leftarrow a \cdot x_i + y_i $$ For each element:
- FLOPs = 2 (1 multiply, 1 add).
- Memory: need to read x_i, read y_i, and write y_i → 3 floats (12 bytes). $$ \text{Arithmetic Intensity} = \frac{2 \, \text{FLOPs}}{12 \, \text{bytes}} \approx 0.17 \, \text{FLOPs/byte} $$ This is very low → dominated by memory bandwidth, not GPU compute.
Performance = min( Compute capacity , Bandwidth × AI ).

2.4 CUDA Programming Model

Definition: CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and API.
- Extends C/C++/Fortran with keywords (e.g., __global__, __device__) to write GPU kernels.
- Provides a programming abstraction over the GPU hardware (threads, blocks, grids, memory hierarchy).

3. What about TPUs?

TPU = Tensor Processing Unit (Google’s custom application-specific integrated circuit/ASIC for ML).
High-level design: lightweight control logic + systolic array for fast matrix multiplies + high-bandwidth on-chip memory.
GPU model: many SMs (general-purpose multiprocessors with CUDA cores + tensor cores).
TPU model: fewer but much larger matrix units (systolic arrays) instead of many SMs; designed to maximize matmul throughput.
Tradeoffs:
- Extremely efficient for dense matmul/convolution (training & inference).
- Less flexible than GPUs for irregular or non-matmul workloads.

Other Notes

hardwares nowadays are designed for fast matrix multiplication. Matmuls are >10x faster than other floating point ops.
FLOPs scale faster than memory – it’s hard to keep our compute units fed with data