API Reference

Vectorized Memory Access

Basic Operations

KernelIntrinsics.vloadFunction
vload(A::AbstractArray{T}, idx, ::Val{Nitem}, ::Val{Rebase}=Val(true), ::Val{Alignment}=Val(-1)) -> NTuple{Nitem,T}

Load Nitem elements from array A as a tuple, using vectorized memory operations on GPU. Nitem must be a positive power of 2.

Arguments

  • A: Source array
  • idx: Starting index
  • Nitem: Number of elements to load (must be a positive power of 2)
  • Rebase: Indexing mode (default: Val(true))
  • Alignment: Known pointer alignment (default: Val(-1) = unknown)
    • Val(1): pointer is Nitem-aligned → single ld.global.vN, no runtime check
    • Val(2..Nitem): known misalignment offset → static load pattern, no runtime dispatch
    • Val(-1): unknown → runtime pointer alignment check (default)
    Only meaningful for Rebase=true; ignored for Rebase=false.

Indexing Modes

  • Val(true) (rebased): Uses 1-based block indexing — idx selects the idx-th contiguous block of Nitem elements, i.e. loads from (idx-1)*Nitem + 1 to idx*Nitem. For example, idx=2 loads elements [5,6,7,8] for Nitem=4. When the array base pointer is Nitem-aligned, this generates optimal aligned vector loads (ld.global.v4); otherwise falls back to vload_multi.
  • Val(false) (direct): Loads starting directly at idx, so idx=2 loads elements [2,3,4,5]. Always uses vload_multi to handle potential misalignment.

Example

a = CuArray{Int32}(1:16)

# Rebased indexing (default): idx=2 → loads block 2, i.e. elements 5,6,7,8
values = vload(a, 2, Val(4))  # returns (5, 6, 7, 8)

# Direct indexing: idx=2 → loads elements 2,3,4,5
values = vload(a, 2, Val(4), Val(false))  # returns (2, 3, 4, 5)

# Known alignment: view offset by 1 element → Alignment=2
v = view(a, 2:16)
values = vload(v, 1, Val(4), Val(true), Val(2))  # static (1,2,1) pattern, no runtime branch

See also: vstore!

source
KernelIntrinsics.vstore!Function
vstore!(A::AbstractArray{T}, idx, values::NTuple{Nitem,T}, ::Val{Rebase}=Val(true), ::Val{Alignment}=Val(-1)) -> Nothing

Store Nitem elements from a tuple to array A, using vectorized memory operations on GPU. Nitem must be a positive power of 2.

Arguments

  • A: Destination array
  • idx: Starting index
  • values: Tuple of Nitem elements to store
  • Rebase: Indexing mode (default: Val(true))
  • Alignment: Known pointer alignment (default: Val(-1) = unknown) Only meaningful for Rebase=true; ignored for Rebase=false.

Indexing Modes

  • Val(true) (rebased): Uses 1-based block indexing — idx selects the idx-th contiguous block of Nitem elements, i.e. stores to (idx-1)*Nitem + 1 through idx*Nitem. For example, idx=2 stores to elements [5,6,7,8] for Nitem=4. When the array base pointer is Nitem-aligned, this generates optimal aligned vector stores (st.global.v4); otherwise falls back to vstore_multi!.
  • Val(false) (direct): Stores starting directly at idx, so idx=2 stores to elements [2,3,4,5]. Always uses vstore_multi! to handle potential misalignment.

Example

b = CUDA.zeros(Int32, 16)

# Rebased indexing (default): idx=2 → stores to block 2, i.e. elements 5,6,7,8
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)))

# Direct indexing: idx=2 → stores to elements 2,3,4,5
vstore!(b, 2, (Int32(10), Int32(20), Int32(30), Int32(40)), Val(false))

See also: vload

source

Memory Ordering

Macros

KernelIntrinsics.@fenceMacro
@fence [Scope] [Ordering]

Insert a memory fence with specified scope and ordering.

A memory fence ensures that memory operations before the fence are visible to other threads before operations after the fence. This is essential for correct synchronization in parallel GPU code.

Arguments

  • Scope (optional): Visibility scope, one of Device (default, maps to .gpu in PTX), Workgroup (maps to .cta), or System (maps to .sys).
  • Ordering (optional): Memory ordering, one of Acquire, Release, AcqRel (default), or SeqCst. Weak, Volatile, and Relaxed are not valid for fences.

Arguments can be specified in any order.

Generated PTX

  • @fencefence.acq_rel.gpu
  • @fence Workgroupfence.acq_rel.cta
  • @fence System SeqCstfence.sc.sys

Example

@kernel function synchronized_kernel(X, Flag)
    X[1] = 10
    @fence  # Ensure X[1]=10 is visible to other threads before continuing
    Flag[1] = 1
end

# Explicit scope and ordering
@fence Device AcqRel
@fence Workgroup Release
@fence System SeqCst
@fence SeqCst Device  # Order doesn't matter

See also: @access

source
KernelIntrinsics.@accessMacro
@access [Scope] [Ordering] expr

Perform a memory load or store with specified scope and ordering semantics.

This macro provides fine-grained control over memory ordering for lock-free synchronization patterns on GPU. It generates appropriate ld.acquire or st.release PTX instructions.

Arguments

  • Scope (optional): Visibility scope, one of Device (default), Workgroup, or System. Cannot be specified with Volatile or Weak orderings, as those are scope-less.
  • Ordering (optional): Memory ordering (see below).
  • expr: A load or store expression (see Syntax Forms).

Arguments can be specified in any order.

Orderings

For loads (default: Acquire):

  • Acquire: Subsequent reads see all writes before the corresponding release.
  • Relaxed: No ordering guarantees.
  • Volatile: Volatile load — bypasses cache, scope-less.
  • Weak: Weak load — scope-less.

For stores (default: Release):

  • Release: Prior writes are visible to other threads before this store.
  • Relaxed: No ordering guarantees.
  • Volatile: Volatile store — bypasses cache, scope-less.
  • Weak: Weak store — scope-less.

AcqRel and SeqCst are not valid for individual loads/stores; use @fence instead.

Syntax Forms

@access array[idx] = value                  # Release store (default)
@access var = array[idx]                    # Acquire load, result bound to var (default)
@access array[idx]                          # Acquire load, result returned directly

@access Release array[idx] = value          # Explicit ordering
@access Acquire var = array[idx]            # Explicit ordering
@access Device Release array[idx] = value   # Explicit scope and ordering
@access SeqCst Device  # Order doesn't matter

Example

@kernel function producer_consumer(X, Flag)
    if @index(Global, Linear) == 1
        X[1] = 42
        @access Flag[1] = 1  # Release store: X[1]=42 visible before Flag[1]=1
    end

    # Other threads spin-wait using standalone load form
    while (@access Acquire Flag[1]) != 1
    end
    # Now X[1] is guaranteed to be 42
end

See also: @fence

source

Scopes

KernelIntrinsics.DeviceType
Device <: Scope

Device/GPU scope. Synchronizes memory operations across all thread blocks on a single device.

source

Orderings

KernelIntrinsics.WeakType
Weak <: Ordering

Weak memory ordering. Provides minimal ordering guarantees, allowing maximum hardware and compiler reordering flexibility.

source
KernelIntrinsics.VolatileType
Volatile <: Ordering

Volatile memory ordering. Prevents the compiler from caching or reordering the operation, ensuring each access goes directly to memory. Does not imply atomicity or inter-thread synchronization.

source
KernelIntrinsics.RelaxedType
Relaxed <: Ordering

Relaxed memory ordering. Ensures atomicity of individual operations but provides no synchronization guarantees. Operations may be reordered freely by hardware and compiler.

source
KernelIntrinsics.AcquireType
Acquire <: Ordering

Acquire memory ordering. Ensures that all memory operations after this point see all writes that happened before a corresponding Release operation.

source
KernelIntrinsics.ReleaseType
Release <: Ordering

Release memory ordering. Ensures that all memory operations before this point are visible to threads that subsequently perform an Acquire operation.

source
KernelIntrinsics.SeqCstType
SeqCst <: Ordering

Sequential consistency. Provides the strongest memory ordering guarantees, establishing a total order of all sequentially consistent operations across all threads.

source

Warp Operations

Macros

KernelIntrinsics.@shflMacro
@shfl(direction, val, src, [mask=0xffffffff])

Perform a warp shuffle operation, exchanging values between lanes within a warp.

Arguments

  • direction: Shuffle direction (Up, Down, Xor, or Idx)
  • val: Value to shuffle (supports primitives, structs, and NTuples)
  • src: Offset (for Up/Down), XOR mask (for Xor), or source lane 0-based index (for Idx)
  • mask: Lane participation mask (default: 0xffffffff for all lanes)

Example

@kernel function shfl_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]

    shuffled = @shfl(Up, val, 1)    # Lane i receives from lane i-1; lane 0 keeps its value
    shuffled = @shfl(Down, val, 1)  # Lane i receives from lane i+1; last lane keeps its value
    shuffled = @shfl(Xor, val, 1)   # Swap adjacent pairs (lane 0↔1, 2↔3, ...)
    shuffled = @shfl(Idx, val, 0)   # Broadcast lane 0 to all lanes

    dst[I] = shuffled
end

See also: @warpreduce, @warpfold

source
KernelIntrinsics.@warpreduceMacro
@warpreduce(val, op, [lane=@laneid()], [warpsize=@warpsize()], [mask=0xffffffff])

Perform an inclusive prefix scan within a warp using shuffle-up operations.

After this macro, lane i (1-based) holds the result of applying op to the values of lanes 1 through i. The result in the last lane is the warp-wide reduction.

Arguments

  • val: Value to scan (modified in-place)
  • op: Binary associative operator
  • lane: Current lane index (1-based; default: @laneid())
  • warpsize: Warp size (default: @warpsize())
  • mask: Lane participation mask (default: 0xffffffff)

Example

@kernel function scan_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]

    @warpreduce(val, +)

    dst[I] = val
end

# Input:  [1, 2, 3, 4, ..., 32]
# Output: [1, 3, 6, 10, ..., 528]

See also: @warpfold, @shfl

source
KernelIntrinsics.@warpfoldMacro
@warpfold(val, op, [lane=@laneid()], [warpsize=@warpsize()], [mask=0xffffffff])

Perform a warp-wide reduction, combining all lane values using the specified operator. Uses shuffle-down operations internally. After this macro, all lanes hold the warp-wide result.

Arguments

  • val: Value to reduce (modified in-place)
  • op: Binary associative operator
  • lane: Current lane index (1-based; accepted for API consistency but unused; default: @laneid())
  • warpsize: Warp size (default: @warpsize())
  • mask: Lane participation mask (default: 0xffffffff)

Example

@kernel function reduce_kernel(dst, src)
    I = @index(Global, Linear)
    val = src[I]
    lane = (I - 1) % @warpsize() + 1

    @warpfold(val, +)

    if lane == 1
        dst[1] = val  # Contains sum of all warp values
    end
end

# Input:  [1, 2, 3, ..., 32]
# Output: dst[1] = 528

See also: @warpreduce, @shfl

source
KernelIntrinsics.@voteMacro
@vote(mode, predicate, [mask=0xffffffff])

Perform a warp vote operation, evaluating a predicate across all participating lanes.

Arguments

  • mode: Vote mode (All, AnyLane, Uni, or Ballot)
  • predicate: Boolean predicate to evaluate
  • mask: Lane participation mask (default: 0xffffffff for all lanes)

Example

@kernel function vote_kernel(dst, src, threshold)
    I = @index(Global, Linear)
    val = src[I]

    all_above = @vote(All,     val > threshold)  # true if all lanes satisfy predicate
    any_above = @vote(AnyLane, val > threshold)  # true if any lane satisfies predicate
    uniform   = @vote(Uni,     val > threshold)  # true if all lanes have the same result
    bits      = @vote(Ballot,  val > threshold)  # UInt32 bitmask: bit i (0-based) set if lane i satisfies predicate

    dst[I] = bits
end

See also: @shfl

source

Shuffle Directions

KernelIntrinsics.DirectionType
Direction

Abstract type representing warp shuffle directions.

Subtypes:

  • Up: Shuffle values from lower lane indices
  • Down: Shuffle values from higher lane indices
  • Xor: Shuffle values using XOR of lane indices
  • Idx: Shuffle values from a specific lane index
source
KernelIntrinsics.UpType
Up <: Direction

Shuffle direction where each lane receives a value from a lane with a lower index.

@shfl(Up, val, offset): Lane i receives the value from lane i - offset. Lanes where i < offset keep their original value.

Result for offset=1 (warpsize=32): [1, 1, 2, 3, 4, ..., 31]

source
KernelIntrinsics.DownType
Down <: Direction

Shuffle direction where each lane receives a value from a lane with a higher index.

@shfl(Down, val, offset): Lane i receives the value from lane i + offset. Lanes where i + offset >= warpsize keep their original value.

Result for offset=1 (warpsize=32): [2, 3, 4, ..., 31, 32, 32]

source
KernelIntrinsics.XorType
Xor <: Direction

Shuffle direction where each lane exchanges values based on XOR of lane indices.

@shfl(Xor, val, mask): Lane i receives the value from lane i ⊻ mask.

Common patterns:

  • mask=1: Swap adjacent pairs (0↔1, 2↔3, ...)
  • mask=16: Swap first and second half of warp
source
KernelIntrinsics.IdxType
Idx <: Direction

Shuffle direction where all lanes receive a value from a specific lane index.

@shfl(Idx, val, lane): All lanes receive the value from lane lane (0-based).

Useful for broadcasting a value from one lane to all others.

source

Vote Modes

KernelIntrinsics.ModeType
Mode

Abstract type representing warp vote modes.

Subtypes:

  • All: True if predicate is true for all lanes
  • AnyLane: True if predicate is true for any lane
  • Uni: True if predicate is uniform across all lanes
  • Ballot: Returns a bitmask of predicate values
source
KernelIntrinsics.UniType
Uni <: Mode

Vote mode that returns true if the predicate has the same value across all participating lanes.

source
KernelIntrinsics.BallotType
Ballot <: Mode

Vote mode that returns a UInt32 bitmask where bit i (0-based) is set if lane i's predicate is true.

source

Index