Introduction to GPU Architecture

The Mental Model

CPU: A few brilliant experts — 4–16 fast cores with complex logic, branch prediction, large caches. Optimized for sequential, complex tasks.

GPU: An army of simple workers — thousands of lightweight cores with minimal logic. Optimized for parallel, repetitive tasks.

GPUs trade individual thread performance for massive parallelism.

GPU Architecture: The Hierarchy

The GPU is organized in three levels:

Warp: 32 threads executing in lockstep
Block: Multiple warps that can share memory and synchronize
SM (Streaming Multiprocessor): Physical hardware running blocks

The number of blocks we launch is an abstraction — the GPU scheduler assigns them to SMs.

Synchronization Within a Warp

Threads within a warp execute instructions in lockstep — they are implicitly synchronized. No explicit barriers are needed. When thread 0 performs x += 1, all other threads in the same warp observe the updated value of x immediately after the operation completes. Threads can read each other’s registers through shuffle operations.

Example: Summing 32 elements using only registers.

Each thread holds one value. Shuffle instructions allow threads to read directly from one another’s registers — no memory access required. The sum is computed in five steps, achieving \(O(\log_2 n)\) complexity.

Synchronization Between Warps (Within a Block)

Warps within a block execute independently — Warp 0 might complete while Warp 3 is still running.

To coordinate:

Shared memory: fast on-chip memory visible to all threads in the block
@synchronize: a barrier forcing all warps to wait

After @synchronize, shared memory is consistent — every thread sees what others wrote before the barrier.

Synchronization Between Blocks

Blocks are independent — Block 1 on SM 1 knows nothing about Block 3 on SM 2. There is no @synchronize equivalent that works across blocks. The scheduler provides no guarantees about execution order: Block 47 might finish before Block 1 even starts, and not all blocks may be resident simultaneously.

To communicate between blocks:

Multi-kernel launches: The safest pattern. Kernel boundaries provide implicit global synchronization — all blocks from kernel 1 are guaranteed complete before kernel 2 starts. Adds launch overhead (~5-10 μs) but eliminates deadlock risks entirely.
Atomic operations: Hardware-guaranteed read-modify-write on global memory. Safe and correct, but slow (hundreds of cycles) and limited to primitive types (Float32, Float64, Int32…).
Global memory with fences: Regular writes combined with fence to ensure visibility. Faster than atomics for bulk data, but requires careful ordering — write data first, fence, then set a flag. The consumer must also fence after reading the flag before accessing the data.

The Memory Hierarchy

Memory	Scope	Latency	Size	Bandwidth
Registers	Per thread	~1 cycle	Limited	N/A
Shared	Per block	~5 cycles	~48 KB	~10 TB/s
Global	All blocks	~500 cycles	GBs	~1 TB/s

Summary: Synchronization & Memory

Level	Synchronization	Memory	Cost
Within warp	Automatic (lockstep)	Registers	Free
Within block	`@synchronize`	Shared memory	Cheap
Between blocks	Atomics / kernel launch	Global memory	Expensive

The Puzzle: Why So Many Blocks?

A typical laptop GPU has ~4 SMs, each running a handful of warps concurrently — perhaps 512 threads executing at any instant.

Yet we routinely launch thousands of blocks. Why?

The answer lies not in compute, but in memory latency.

Memory: The Producer-Consumer View

Think of memory as a slow producer (~500 cycles per request) and threads as consumers.

If threads just waited, the GPU would be 99% idle.

Latency Hiding

The problem: Memory requests take ~500 cycles. If threads waited idly, the GPU would be stalled.

The solution: Keep many warps ready to run.

While Warp 1 waits for data, Warps 2–7 compute. The scheduler switches between them instantly — zero-cost context switching because each warp maintains its own registers.

More warps in flight = more latency hiding = higher throughput.

This is why we launch far more threads than can physically execute at once.

Coalesced Memory Access

Memory is transferred in chunks (typically 32 or 128 bytes). When threads in a warp access consecutive addresses, the hardware coalesces these into a single transaction.

In Julia (column-major order):

Vector v[8]:   v[1], v[2], v[3], v[4], v[5], v[6], v[7], v[8]

Matrix M[2,4]: M[1,1], M[2,1], M[1,2], M[2,2], M[1,3], M[2,3], M[1,4], M[2,4]

When threads access consecutive addresses, hardware coalesces them into fewer transactions:

Good: Thread i accesses index i → 1 transaction
Bad: Thread i accesses index i×1000 → 32 transactions

Adjacent threads should access adjacent memory.

In Practice: How to get GPU Infos

julia> dev = CUDA.device()
julia> CUDA.name(dev)
Device: NVIDIA RTX 1000 Ada Generation Laptop GPU

# Global Memory:
julia> round(Int(CUDA.totalmem(dev)) / 1024^3, digits=3)
5.763 # GB

# Shared Memory:
julia> CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MAX_SHARED_MEMORY_PER_BLOCK)÷1024
48 # KB

julia> clock_rate = CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_CLOCK_RATE)
1740000 # kHz

julia> mem_clock_rate = CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MEMORY_CLOCK_RATE)
8001000 # kHz

julia> mem_bus_width = CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_GLOBAL_MEMORY_BUS_WIDTH) # bits
96 # bits

julia> bandwidth = 2 * mem_clock_rate * 1e3 * mem_bus_width / 8 / 1024^3
178.83628606796265 # GiB/s

Summary

GPUs trade single-thread speed for parallelism — one thread is slow, millions are fast
Warp (32 threads): implicitly synchronized, communicate via registers
Block (multiple warps): explicit synchronization with @synchronize, communicate via shared memory
Grid (all blocks): no synchronization guarantees, communicate via global memory (expensive)
Threads are an abstraction — launch many more than physically execute
Latency hiding is why: keep warps waiting on memory while others compute
Coalesced access is critical: adjacent threads must access adjacent memory