API Reference
Vectorized Memory Access
Basic Operations
KernelIntrinsics.vload — Function
vload(A::AbstractArray{T}, idx, ::Val{Nitem}, ::Val{Rebase}=Val(true), ::Val{Alignment}=Val(-1)) -> NTuple{Nitem,T}Load Nitem elements from array A as a tuple, using vectorized memory operations on GPU. Nitem must be a positive power of 2.
Arguments
A: Source arrayidx: Starting indexNitem: Number of elements to load (must be a positive power of 2)Rebase: Indexing mode (default:Val(true))Alignment: Known pointer alignment (default:Val(-1)= unknown)Val(1): pointer isNitem-aligned → singleld.global.vN, no runtime checkVal(2..Nitem): known misalignment offset → static load pattern, no runtime dispatchVal(-1): unknown → runtime pointer alignment check (default)
Rebase=true; ignored forRebase=false.
Indexing Modes
Val(true)(rebased): Uses 1-based block indexing —idxselects theidx-th contiguous block ofNitemelements, i.e. loads from(idx-1)*Nitem + 1toidx*Nitem. For example,idx=2loads elements[5,6,7,8]forNitem=4. When the array base pointer isNitem-aligned, this generates optimal aligned vector loads (ld.global.v4); otherwise falls back tovload_multi.Val(false)(direct): Loads starting directly atidx, soidx=2loads elements[2,3,4,5]. Always usesvload_multito handle potential misalignment.
Example
a = CuArray{Int32}(1:16)
# Rebased indexing (default): idx=2 → loads block 2, i.e. elements 5,6,7,8
values = vload(a, 2, Val(4)) # returns (5, 6, 7, 8)
# Direct indexing: idx=2 → loads elements 2,3,4,5
values = vload(a, 2, Val(4), Val(false)) # returns (2, 3, 4, 5)
# Known alignment: view offset by 1 element → Alignment=2
v = view(a, 2:16)
values = vload(v, 1, Val(4), Val(true), Val(2)) # static (1,2,1) pattern, no runtime branchSee also: vstore!
KernelIntrinsics.vstore! — Function
vstore!(A::AbstractArray{T}, idx, values::NTuple{Nitem,T}, ::Val{Rebase}=Val(true), ::Val{Alignment}=Val(-1)) -> NothingStore Nitem elements from a tuple to array A, using vectorized memory operations on GPU. Nitem must be a positive power of 2.
Arguments
A: Destination arrayidx: Starting indexvalues: Tuple ofNitemelements to storeRebase: Indexing mode (default:Val(true))Alignment: Known pointer alignment (default:Val(-1)= unknown) Only meaningful forRebase=true; ignored forRebase=false.
Indexing Modes
Val(true)(rebased): Uses 1-based block indexing —idxselects theidx-th contiguous block ofNitemelements, i.e. stores to(idx-1)*Nitem + 1throughidx*Nitem. For example,idx=2stores to elements[5,6,7,8]forNitem=4. When the array base pointer isNitem-aligned, this generates optimal aligned vector stores (st.global.v4); otherwise falls back tovstore_multi!.Val(false)(direct): Stores starting directly atidx, soidx=2stores to elements[2,3,4,5]. Always usesvstore_multi!to handle potential misalignment.
Example
b = CUDA.zeros(Int32, 16)
# Rebased indexing (default): idx=2 → stores to block 2, i.e. elements 5,6,7,8
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)))
# Direct indexing: idx=2 → stores to elements 2,3,4,5
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)), Val(false))See also: vload
Memory Ordering
Macros
KernelIntrinsics.@fence — Macro
@fence [Scope] [Ordering]Insert a memory fence with specified scope and ordering.
A memory fence ensures that memory operations before the fence are visible to other threads before operations after the fence. This is essential for correct synchronization in parallel GPU code.
Arguments
Scope(optional): Visibility scope, one ofDevice(default, maps to.gpuin PTX),Workgroup(maps to.cta), orSystem(maps to.sys).Ordering(optional): Memory ordering, one ofAcquire,Release,AcqRel(default), orSeqCst.Weak,Volatile, andRelaxedare not valid for fences.
Arguments can be specified in any order.
Generated PTX
@fence→fence.acq_rel.gpu@fence Workgroup→fence.acq_rel.cta@fence System SeqCst→fence.sc.sys
Example
@kernel function synchronized_kernel(X, Flag)
X[1] = 10
@fence # Ensure X[1]=10 is visible to other threads before continuing
Flag[1] = 1
end
# Explicit scope and ordering
@fence Device AcqRel
@fence Workgroup Release
@fence System SeqCst
@fence SeqCst Device # Order doesn't matterSee also: @access
KernelIntrinsics.@access — Macro
@access [Scope] [Ordering] exprPerform a memory load or store with specified scope and ordering semantics.
This macro provides fine-grained control over memory ordering for lock-free synchronization patterns on GPU. It generates appropriate ld.acquire or st.release PTX instructions.
Arguments
Scope(optional): Visibility scope, one ofDevice(default),Workgroup, orSystem. Cannot be specified withVolatileorWeakorderings, as those are scope-less.Ordering(optional): Memory ordering (see below).expr: A load or store expression (see Syntax Forms).
Arguments can be specified in any order.
Orderings
For loads (default: Acquire):
Acquire: Subsequent reads see all writes before the corresponding release.Relaxed: No ordering guarantees.Volatile: Volatile load — bypasses cache, scope-less.Weak: Weak load — scope-less.
For stores (default: Release):
Release: Prior writes are visible to other threads before this store.Relaxed: No ordering guarantees.Volatile: Volatile store — bypasses cache, scope-less.Weak: Weak store — scope-less.
AcqRel and SeqCst are not valid for individual loads/stores; use @fence instead.
Syntax Forms
@access array[idx] = value # Release store (default)
@access var = array[idx] # Acquire load, result bound to var (default)
@access array[idx] # Acquire load, result returned directly
@access Release array[idx] = value # Explicit ordering
@access Acquire var = array[idx] # Explicit ordering
@access Device Release array[idx] = value # Explicit scope and ordering
@access SeqCst Device # Order doesn't matterExample
@kernel function producer_consumer(X, Flag)
if @index(Global, Linear) == 1
X[1] = 42
@access Flag[1] = 1 # Release store: X[1]=42 visible before Flag[1]=1
end
# Other threads spin-wait using standalone load form
while (@access Acquire Flag[1]) != 1
end
# Now X[1] is guaranteed to be 42
endSee also: @fence
Scopes
KernelIntrinsics.Scope — Type
ScopeAbstract type representing the scope of a memory operation or fence.
Subtypes:
KernelIntrinsics.Workgroup — Type
Workgroup <: ScopeThread block/workgroup scope. Synchronizes memory operations within a single thread block/workgroup.
KernelIntrinsics.Device — Type
Device <: ScopeDevice/GPU scope. Synchronizes memory operations across all thread blocks on a single device.
KernelIntrinsics.System — Type
System <: ScopeSystem scope. Synchronizes memory operations across all devices, including the CPU.
Orderings
KernelIntrinsics.Ordering — Type
OrderingAbstract type representing memory ordering semantics.
Subtypes:
KernelIntrinsics.Weak — Type
Weak <: OrderingWeak memory ordering. Provides minimal ordering guarantees, allowing maximum hardware and compiler reordering flexibility.
KernelIntrinsics.Volatile — Type
Volatile <: OrderingVolatile memory ordering. Prevents the compiler from caching or reordering the operation, ensuring each access goes directly to memory. Does not imply atomicity or inter-thread synchronization.
KernelIntrinsics.Relaxed — Type
Relaxed <: OrderingRelaxed memory ordering. Ensures atomicity of individual operations but provides no synchronization guarantees. Operations may be reordered freely by hardware and compiler.
KernelIntrinsics.Acquire — Type
Acquire <: OrderingAcquire memory ordering. Ensures that all memory operations after this point see all writes that happened before a corresponding Release operation.
KernelIntrinsics.Release — Type
Release <: OrderingRelease memory ordering. Ensures that all memory operations before this point are visible to threads that subsequently perform an Acquire operation.
KernelIntrinsics.AcqRel — Type
AcqRel <: OrderingAcquire-Release memory ordering. Combines both Acquire and Release semantics. Used for read-modify-write operations and fences.
KernelIntrinsics.SeqCst — Type
SeqCst <: OrderingSequential consistency. Provides the strongest memory ordering guarantees, establishing a total order of all sequentially consistent operations across all threads.
Warp Operations
Macros
KernelIntrinsics.@warpsize — Macro
@warpsize()Return the warp size of the current backend as an Int. Queries the backend at runtime — 32 on CUDA, 64 on ROCm.
See also: @laneid, @shfl, @warpreduce, @warpfold
KernelIntrinsics.@laneid — Macro
@laneid()Return the 1-based lane index of the current thread within its warp/wavefront.
See also: @warpsize, @warpreduce, @warpfold
KernelIntrinsics.@shfl — Macro
@shfl(direction, val, src, [mask=0xffffffff])Perform a warp shuffle operation, exchanging values between lanes within a warp.
Arguments
direction: Shuffle direction (Up,Down,Xor, orIdx)val: Value to shuffle (supports primitives, structs, and NTuples)src: Offset (forUp/Down), XOR mask (forXor), or source lane 0-based index (forIdx)mask: Lane participation mask (default:0xfffffffffor all lanes)
Example
@kernel function shfl_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
shuffled = @shfl(Up, val, 1) # Lane i receives from lane i-1; lane 0 keeps its value
shuffled = @shfl(Down, val, 1) # Lane i receives from lane i+1; last lane keeps its value
shuffled = @shfl(Xor, val, 1) # Swap adjacent pairs (lane 0↔1, 2↔3, ...)
shuffled = @shfl(Idx, val, 0) # Broadcast lane 0 to all lanes
dst[I] = shuffled
endSee also: @warpreduce, @warpfold
KernelIntrinsics.@warpreduce — Macro
@warpreduce(val, op, [lane=@laneid()], [warpsize=@warpsize()], [mask=0xffffffff])Perform an inclusive prefix scan within a warp using shuffle-up operations.
After this macro, lane i (1-based) holds the result of applying op to the values of lanes 1 through i. The result in the last lane is the warp-wide reduction.
Arguments
val: Value to scan (modified in-place)op: Binary associative operatorlane: Current lane index (1-based; default:@laneid())warpsize: Warp size (default:@warpsize())mask: Lane participation mask (default:0xffffffff)
Example
@kernel function scan_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
@warpreduce(val, +)
dst[I] = val
end
# Input: [1, 2, 3, 4, ..., 32]
# Output: [1, 3, 6, 10, ..., 528]KernelIntrinsics.@warpfold — Macro
@warpfold(val, op, [lane=@laneid()], [warpsize=@warpsize()], [mask=0xffffffff])Perform a warp-wide reduction, combining all lane values using the specified operator. Uses shuffle-down operations internally. After this macro, all lanes hold the warp-wide result.
Arguments
val: Value to reduce (modified in-place)op: Binary associative operatorlane: Current lane index (1-based; accepted for API consistency but unused; default:@laneid())warpsize: Warp size (default:@warpsize())mask: Lane participation mask (default:0xffffffff)
Example
@kernel function reduce_kernel(dst, src)
I = @index(Global, Linear)
val = src[I]
lane = (I - 1) % @warpsize() + 1
@warpfold(val, +)
if lane == 1
dst[1] = val # Contains sum of all warp values
end
end
# Input: [1, 2, 3, ..., 32]
# Output: dst[1] = 528See also: @warpreduce, @shfl
KernelIntrinsics.@vote — Macro
@vote(mode, predicate, [mask=0xffffffff])Perform a warp vote operation, evaluating a predicate across all participating lanes.
Arguments
mode: Vote mode (All,AnyLane,Uni, orBallot)predicate: Boolean predicate to evaluatemask: Lane participation mask (default:0xfffffffffor all lanes)
Example
@kernel function vote_kernel(dst, src, threshold)
I = @index(Global, Linear)
val = src[I]
all_above = @vote(All, val > threshold) # true if all lanes satisfy predicate
any_above = @vote(AnyLane, val > threshold) # true if any lane satisfies predicate
uniform = @vote(Uni, val > threshold) # true if all lanes have the same result
bits = @vote(Ballot, val > threshold) # UInt32 bitmask: bit i (0-based) set if lane i satisfies predicate
dst[I] = bits
endSee also: @shfl
Shuffle Directions
KernelIntrinsics.Up — Type
Up <: DirectionShuffle direction where each lane receives a value from a lane with a lower index.
@shfl(Up, val, offset): Lane i receives the value from lane i - offset. Lanes where i < offset keep their original value.
Result for offset=1 (warpsize=32): [1, 1, 2, 3, 4, ..., 31]
KernelIntrinsics.Down — Type
Down <: DirectionShuffle direction where each lane receives a value from a lane with a higher index.
@shfl(Down, val, offset): Lane i receives the value from lane i + offset. Lanes where i + offset >= warpsize keep their original value.
Result for offset=1 (warpsize=32): [2, 3, 4, ..., 31, 32, 32]
KernelIntrinsics.Xor — Type
Xor <: DirectionShuffle direction where each lane exchanges values based on XOR of lane indices.
@shfl(Xor, val, mask): Lane i receives the value from lane i ⊻ mask.
Common patterns:
mask=1: Swap adjacent pairs (0↔1, 2↔3, ...)mask=16: Swap first and second half of warp
KernelIntrinsics.Idx — Type
Idx <: DirectionShuffle direction where all lanes receive a value from a specific lane index.
@shfl(Idx, val, lane): All lanes receive the value from lane lane (0-based).
Useful for broadcasting a value from one lane to all others.
Vote Modes
KernelIntrinsics.All — Type
All <: ModeVote mode that returns true if the predicate is true for all participating lanes.
KernelIntrinsics.AnyLane — Type
AnyLane <: ModeVote mode that returns true if the predicate is true for any participating lane.
KernelIntrinsics.Uni — Type
Uni <: ModeVote mode that returns true if the predicate has the same value across all participating lanes.
KernelIntrinsics.Ballot — Type
Ballot <: ModeVote mode that returns a UInt32 bitmask where bit i (0-based) is set if lane i's predicate is true.
Index
KernelIntrinsics.AcqRelKernelIntrinsics.AcquireKernelIntrinsics.AllKernelIntrinsics.AnyLaneKernelIntrinsics.BallotKernelIntrinsics.DeviceKernelIntrinsics.DirectionKernelIntrinsics.DownKernelIntrinsics.IdxKernelIntrinsics.ModeKernelIntrinsics.OrderingKernelIntrinsics.RelaxedKernelIntrinsics.ReleaseKernelIntrinsics.ScopeKernelIntrinsics.SeqCstKernelIntrinsics.SystemKernelIntrinsics.UniKernelIntrinsics.UpKernelIntrinsics.VolatileKernelIntrinsics.WeakKernelIntrinsics.WorkgroupKernelIntrinsics.XorKernelIntrinsics.vloadKernelIntrinsics.vstore!KernelIntrinsics.@accessKernelIntrinsics.@fenceKernelIntrinsics.@laneidKernelIntrinsics.@shflKernelIntrinsics.@voteKernelIntrinsics.@warpfoldKernelIntrinsics.@warpreduceKernelIntrinsics.@warpsize