API Reference

Vectorized Memory Access

Basic Operations

KernelIntrinsics.vload — Function

vload(A::AbstractArray{T}, idx, ::Val{Nitem}, ::Val{Rebase}=Val(true), ::Val{Alignment}=Val(-1)) -> NTuple{Nitem,T}

Load Nitem elements from array A as a tuple, using vectorized memory operations on GPU. Nitem must be a positive power of 2.

Arguments

A: Source array
idx: Starting index
Nitem: Number of elements to load (must be a positive power of 2)
Rebase: Indexing mode (default: Val(true))
Alignment: Known pointer alignment (default: Val(-1) = unknown)
- Val(1): pointer is Nitem-aligned → single ld.global.vN, no runtime check
- Val(2..Nitem): known misalignment offset → static load pattern, no runtime dispatch
- Val(-1): unknown → runtime pointer alignment check (default)
Only meaningful for Rebase=true; ignored for Rebase=false.

Indexing Modes

Val(true) (rebased): Uses 1-based block indexing — idx selects the idx-th contiguous block of Nitem elements, i.e. loads from (idx-1)*Nitem + 1 to idx*Nitem. For example, idx=2 loads elements [5,6,7,8] for Nitem=4. When the array base pointer is Nitem-aligned, this generates optimal aligned vector loads (ld.global.v4); otherwise falls back to vload_multi.
Val(false) (direct): Loads starting directly at idx, so idx=2 loads elements [2,3,4,5]. Always uses vload_multi to handle potential misalignment.

Example

a = CuArray{Int32}(1:16)

# Rebased indexing (default): idx=2 → loads block 2, i.e. elements 5,6,7,8
values = vload(a, 2, Val(4))  # returns (5, 6, 7, 8)

# Direct indexing: idx=2 → loads elements 2,3,4,5
values = vload(a, 2, Val(4), Val(false))  # returns (2, 3, 4, 5)

# Known alignment: view offset by 1 element → Alignment=2
v = view(a, 2:16)
values = vload(v, 1, Val(4), Val(true), Val(2))  # static (1,2,1) pattern, no runtime branch

Memory Ordering

Macros

KernelIntrinsics.@fence — Macro

@fence [Scope] [Ordering]

Insert a memory fence with specified scope and ordering.

A memory fence ensures that memory operations before the fence are visible to other threads before operations after the fence. This is essential for correct synchronization in parallel GPU code.

Arguments

Scope (optional): Visibility scope, one of Device (default, maps to .gpu in PTX), Workgroup (maps to .cta), or System (maps to .sys).
Ordering (optional): Memory ordering, one of Acquire, Release, AcqRel (default), or SeqCst. Weak, Volatile, and Relaxed are not valid for fences.

Arguments can be specified in any order.

Generated PTX

@fence → fence.acq_rel.gpu
@fence Workgroup → fence.acq_rel.cta
@fence System SeqCst → fence.sc.sys

Example

@kernel function synchronized_kernel(X, Flag)
    X[1] = 10
    @fence  # Ensure X[1]=10 is visible to other threads before continuing
    Flag[1] = 1
end

# Explicit scope and ordering
@fence Device AcqRel
@fence Workgroup Release
@fence System SeqCst
@fence SeqCst Device  # Order doesn't matter

Scopes

KernelIntrinsics.Scope — Type

Scope

Abstract type representing the scope of a memory operation or fence.

Subtypes:

Workgroup: Thread block/workgroup scope
Device: Device/GPU scope
System: System scope (includes CPU and other devices)

source

KernelIntrinsics.Workgroup — Type

Workgroup <: Scope

Thread block/workgroup scope. Synchronizes memory operations within a single thread block/workgroup.

source

KernelIntrinsics.Device — Type

Device <: Scope

Device/GPU scope. Synchronizes memory operations across all thread blocks on a single device.

source

KernelIntrinsics.System — Type

System <: Scope

System scope. Synchronizes memory operations across all devices, including the CPU.

source

Orderings

KernelIntrinsics.Ordering — Type

Ordering

Abstract type representing memory ordering semantics.

Subtypes:

Weak: Minimal ordering guarantees
Volatile: Prevents compiler caching/reordering; no atomicity implied
Relaxed: Atomicity only, no synchronization
Acquire: Acquire semantics
Release: Release semantics
AcqRel: Acquire-Release semantics
SeqCst: Sequential consistency

source

KernelIntrinsics.Weak — Type

Weak <: Ordering

Weak memory ordering. Provides minimal ordering guarantees, allowing maximum hardware and compiler reordering flexibility.

source

KernelIntrinsics.Volatile — Type

Volatile <: Ordering

Volatile memory ordering. Prevents the compiler from caching or reordering the operation, ensuring each access goes directly to memory. Does not imply atomicity or inter-thread synchronization.

source

KernelIntrinsics.Relaxed — Type

Relaxed <: Ordering

Relaxed memory ordering. Ensures atomicity of individual operations but provides no synchronization guarantees. Operations may be reordered freely by hardware and compiler.

source

KernelIntrinsics.Acquire — Type

Acquire <: Ordering

Acquire memory ordering. Ensures that all memory operations after this point see all writes that happened before a corresponding Release operation.

source

KernelIntrinsics.Release — Type

Release <: Ordering

Release memory ordering. Ensures that all memory operations before this point are visible to threads that subsequently perform an Acquire operation.

source

KernelIntrinsics.AcqRel — Type

AcqRel <: Ordering

Acquire-Release memory ordering. Combines both Acquire and Release semantics. Used for read-modify-write operations and fences.

source

KernelIntrinsics.SeqCst — Type

SeqCst <: Ordering

Sequential consistency. Provides the strongest memory ordering guarantees, establishing a total order of all sequentially consistent operations across all threads.

source

Warp Operations

Macros

KernelIntrinsics.@warpsize — Macro

@warpsize()

Return the warp size of the current backend as an Int. Queries the backend at runtime — 32 on CUDA, 64 on ROCm.

source

KernelIntrinsics.@laneid — Macro

@laneid()

Return the 1-based lane index of the current thread within its warp/wavefront.

source

KernelIntrinsics.@shfl — Macro

@shfl(direction, val, src, [mask=0xffffffff])

Perform a warp shuffle operation, exchanging values between lanes within a warp.

Arguments

direction: Shuffle direction (Up, Down, Xor, or Idx)
val: Value to shuffle (supports primitives, structs, and NTuples)
src: Offset (for Up/Down), XOR mask (for Xor), or source lane 0-based index (for Idx)
mask: Lane participation mask (default: 0xffffffff for all lanes)

Example

@kernel function shfl_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]

    shuffled = @shfl(Up, val, 1)    # Lane i receives from lane i-1; lane 0 keeps its value
    shuffled = @shfl(Down, val, 1)  # Lane i receives from lane i+1; last lane keeps its value
    shuffled = @shfl(Xor, val, 1)   # Swap adjacent pairs (lane 0↔1, 2↔3, ...)
    shuffled = @shfl(Idx, val, 0)   # Broadcast lane 0 to all lanes

    dst[I] = shuffled
end

Shuffle Directions

KernelIntrinsics.Direction — Type

Direction

Abstract type representing warp shuffle directions.

Subtypes:

Up: Shuffle values from lower lane indices
Down: Shuffle values from higher lane indices
Xor: Shuffle values using XOR of lane indices
Idx: Shuffle values from a specific lane index

source

KernelIntrinsics.Up — Type

Up <: Direction

Shuffle direction where each lane receives a value from a lane with a lower index.

@shfl(Up, val, offset): Lane i receives the value from lane i - offset. Lanes where i < offset keep their original value.

Result for offset=1 (warpsize=32): [1, 1, 2, 3, 4, ..., 31]

source

KernelIntrinsics.Down — Type

Down <: Direction

Shuffle direction where each lane receives a value from a lane with a higher index.

@shfl(Down, val, offset): Lane i receives the value from lane i + offset. Lanes where i + offset >= warpsize keep their original value.

Result for offset=1 (warpsize=32): [2, 3, 4, ..., 31, 32, 32]

source

KernelIntrinsics.Xor — Type

Xor <: Direction

Shuffle direction where each lane exchanges values based on XOR of lane indices.

@shfl(Xor, val, mask): Lane i receives the value from lane i ⊻ mask.

Common patterns:

mask=1: Swap adjacent pairs (0↔1, 2↔3, ...)
mask=16: Swap first and second half of warp

source

KernelIntrinsics.Idx — Type

Idx <: Direction

Shuffle direction where all lanes receive a value from a specific lane index.

@shfl(Idx, val, lane): All lanes receive the value from lane lane (0-based).

Useful for broadcasting a value from one lane to all others.

source

Vote Modes

KernelIntrinsics.Mode — Type

Mode

Abstract type representing warp vote modes.

Subtypes:

All: True if predicate is true for all lanes
AnyLane: True if predicate is true for any lane
Uni: True if predicate is uniform across all lanes
Ballot: Returns a bitmask of predicate values

source

KernelIntrinsics.All — Type

All <: Mode

Vote mode that returns true if the predicate is true for all participating lanes.

source

KernelIntrinsics.AnyLane — Type

AnyLane <: Mode

Vote mode that returns true if the predicate is true for any participating lane.

source

KernelIntrinsics.Uni — Type

Uni <: Mode

Vote mode that returns true if the predicate has the same value across all participating lanes.

source

KernelIntrinsics.Ballot — Type

Ballot <: Mode

Vote mode that returns a UInt32 bitmask where bit i (0-based) is set if lane i's predicate is true.

source

Index

KernelIntrinsics.AcqRel
KernelIntrinsics.Acquire
KernelIntrinsics.All
KernelIntrinsics.AnyLane
KernelIntrinsics.Ballot
KernelIntrinsics.Device
KernelIntrinsics.Direction
KernelIntrinsics.Down
KernelIntrinsics.Idx
KernelIntrinsics.Mode
KernelIntrinsics.Ordering
KernelIntrinsics.Relaxed
KernelIntrinsics.Release
KernelIntrinsics.Scope
KernelIntrinsics.SeqCst
KernelIntrinsics.System
KernelIntrinsics.Uni
KernelIntrinsics.Up
KernelIntrinsics.Volatile
KernelIntrinsics.Weak
KernelIntrinsics.Workgroup
KernelIntrinsics.Xor
KernelIntrinsics.vload
KernelIntrinsics.vstore!
KernelIntrinsics.@access
KernelIntrinsics.@fence
KernelIntrinsics.@laneid
KernelIntrinsics.@shfl
KernelIntrinsics.@vote
KernelIntrinsics.@warpfold
KernelIntrinsics.@warpreduce
KernelIntrinsics.@warpsize