API Reference

Copy

Luma.vcopy!Function
vcopy!(dst::AbstractGPUVector, src::AbstractGPUVector; Nitem=4)

Copy src to dst using vectorized GPU memory access.

Performs a high-throughput copy by loading and storing Nitem elements per thread, reducing memory transaction overhead compared to scalar copies.

Arguments

  • dst: Destination GPU vector
  • src: Source GPU vector (must have same length as dst)
  • Nitem=4: Number of elements processed per thread. Higher values improve throughput but require length(src) to be divisible by Nitem.

Example

src = CUDA.rand(Float32, 1024)
dst = CUDA.zeros(Float32, 1024)
vcopy!(dst, src)

See also: Luma.setvalue!

source
Luma.setvalue!Function
setvalue!(dst::AbstractGPUVector{T}, val::T; Nitem=4) where T

Fill dst with val using vectorized GPU memory access.

Performs a high-throughput fill by storing Nitem copies of val per thread, reducing memory transaction overhead compared to scalar writes.

Arguments

  • dst: Destination GPU vector
  • val: Value to fill (must match element type of dst)
  • Nitem=4: Number of elements written per thread. Higher values improve throughput but require length(dst) to be divisible by Nitem.

Example

dst = CUDA.zeros(Float32, 1024)
setvalue!(dst, 1.0f0)

See also: Luma.vcopy!

source

Map-Reduce

Luma.mapreduceFunction
mapreduce(f, op, src::AbstractGPUArray; dims=nothing, kwargs...) -> Array or scalar

GPU parallel map-reduce operation with optional dimension reduction.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • src: Input GPU array

Keyword Arguments

  • dims=nothing: Dimensions to reduce over. Options:
    • nothing or :: Reduce over all dimensions (returns scalar or 1-element array)
    • Int: Reduce over single dimension
    • Tuple{Int...}: Reduce over multiple dimensions (must be contiguous from start or end)
  • g=identity: Post-reduction transformation
  • init=nothing: Initial value (currently unused, for API compatibility)
  • to_cpu=false: If true and dims=nothing, return scalar; otherwise return GPU array
  • Additional kwargs passed to underlying implementations

Dimension Constraints

The dims argument must specify contiguous dimensions from either:

  • The beginning: (1,), (1,2), (1,2,3), etc.
  • The end: (n-1,n), (n,), etc. for an n-dimensional array

Examples

A = CUDA.rand(Float32, 100, 50, 20)

# Full reduction (all dimensions)
total = mapreduce(identity, +, A; to_cpu=true)

# Reduce along dim 1: (100, 50, 20) -> (50, 20)
col_sums = mapreduce(identity, +, A; dims=1)

# Reduce along dims (1,2): (100, 50, 20) -> (20,)
plane_sums = mapreduce(identity, +, A; dims=(1,2))

# Reduce along last dim: (100, 50, 20) -> (100, 50)
depth_sums = mapreduce(identity, +, A; dims=3)

# Reduce along last two dims: (100, 50, 20) -> (100,)
slice_sums = mapreduce(identity, +, A; dims=(2,3))

See also: Luma.mapreduce!, mapreduce1d, mapreduce2d

source
mapreduce(f, op, srcs::NTuple{N,AbstractGPUArray}; dims=nothing, kwargs...)

Multi-array mapreduce. Only supports full reduction (dims=nothing).

source
Luma.mapreduce!Function
mapreduce!(f, op, dst, src; dims=nothing, kwargs...)

In-place GPU parallel map-reduce with dimension support.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • dst: Output array
  • src: Input GPU array

Keyword Arguments

  • dims=nothing: Dimensions to reduce over (see mapreduce for details)
  • g=identity: Post-reduction transformation
  • Additional kwargs passed to underlying implementations

Examples

A = CUDA.rand(Float32, 100, 50)
col_sums = CUDA.zeros(Float32, 50)
row_sums = CUDA.zeros(Float32, 100)

# Column sums (reduce dim 1)
mapreduce!(identity, +, col_sums, A; dims=1)

# Row sums (reduce dim 2)
mapreduce!(identity, +, row_sums, A; dims=2)

See also: Luma.mapreduce

source
Luma.mapreduce2dFunction
mapreduce2d(f, op, src, dim; kwargs...) -> Vector

GPU parallel reduction along dimension dim.

  • dim=1: Column-wise reduction (vertical), output length = number of columns
  • dim=2: Row-wise reduction (horizontal), output length = number of rows

Arguments

  • f: Element-wise transformation
  • op: Reduction operator
  • src: Input matrix of size (n, p)
  • dim: Dimension to reduce along (1 or 2)

Keyword Arguments

  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated temporary buffer
  • FlagType=UInt8: Synchronization flag type

For dim=1 (column-wise):

  • Nitem=nothing: Items per thread
  • Nthreads=nothing: Threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Number of blocks

For dim=2 (row-wise):

  • chunksz=nothing: Chunk size for row processing
  • Nblocks=nothing: Number of blocks per row
  • workgroup=nothing: Workgroup size
  • blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)

# Column sums (reduce along dim=1)
col_sums = mapreduce2d(identity, +, A, 1)

# Row maximums (reduce along dim=2)
row_maxs = mapreduce2d(identity, max, A, 2)

# Column means
col_means = mapreduce2d(identity, +, A, 1; g=x -> x / size(A, 1))

# Sum of squares per row
row_ss = mapreduce2d(abs2, +, A, 2)

See also: Luma.mapreduce2d! for the in-place version.

source
Luma.mapreduce2d!Function
mapreduce2d!(f, op, dst, src, dim; kwargs...)

In-place GPU parallel reduction along dimension dim.

  • dim=1: Column-wise reduction (vertical), dst length = number of columns
  • dim=2: Row-wise reduction (horizontal), dst length = number of rows

Arguments

  • f: Element-wise transformation
  • op: Reduction operator
  • dst: Output vector
  • src: Input matrix of size (n, p)
  • dim: Dimension to reduce along (1 or 2)

Keyword Arguments

  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated temporary buffer
  • FlagType=UInt8: Synchronization flag type

For dim=1 (column-wise):

  • Nitem=nothing: Items per thread
  • Nthreads=nothing: Threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Number of blocks

For dim=2 (row-wise):

  • chunksz=nothing: Chunk size for row processing
  • Nblocks=nothing: Number of blocks per row
  • workgroup=nothing: Workgroup size
  • blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)
col_sums = CUDA.zeros(Float32, 500)
row_maxs = CUDA.zeros(Float32, 1000)

# Column sums
mapreduce2d!(identity, +, col_sums, A, 1)

# Row maximums
mapreduce2d!(identity, max, row_maxs, A, 2)

See also: Luma.mapreduce2d for the allocating version.

source
Luma.mapreduce1dFunction
mapreduce1d(f, op, src; kwargs...) -> GPU array or scalar
mapreduce1d(f, op, srcs::NTuple; kwargs...) -> GPU array or scalar

GPU parallel map-reduce operation.

Applies f to each element, reduces with op, and optionally applies g to the final result.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to final result
  • tmp=nothing: Pre-allocated temporary buffer
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks
  • FlagType=UInt8: Synchronization flag type
  • to_cpu=false: If true, return scalar; otherwise return 1-element GPU array

Examples

# Sum of squares (returns GPU array)
x = CUDA.rand(Float32, 10_000)
result = mapreduce1d(x -> x^2, +, x)

# Sum of squares (returns scalar)
result = mapreduce1d(x -> x^2, +, x; to_cpu=true)

# Dot product of two arrays
x, y = CUDA.rand(Float32, 10_000), CUDA.rand(Float32, 10_000)
result = mapreduce1d((a, b) -> a * b, +, (x, y); to_cpu=true)

See also: Luma.mapreduce1d! for the in-place version.

source
Luma.mapreduce1d!Function
mapreduce1d!(f, op, dst, src; kwargs...)
mapreduce1d!(f, op, dst, srcs::NTuple; kwargs...)

In-place GPU parallel map-reduce, writing result to dst[1].

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • dst: Output array (result written to first element)
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to final result
  • tmp=nothing: Pre-allocated temporary buffer
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks
  • FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Float32, 1)

# Sum
mapreduce1d!(identity, +, dst, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(mapreduce1d!, x)
for i in 1:100
    mapreduce1d!(identity, +, dst, x; tmp)
end

See also: Luma.mapreduce1d for the allocating version.

source

Scan

Luma.scanFunction
scan(f, op, src; kwargs...) -> GPU array
scan(op, src; kwargs...) -> GPU array

GPU parallel prefix scan (cumulative reduction) using a decoupled lookback algorithm.

Applies f to each element, then computes inclusive prefix scan with op.

Arguments

  • f: Map function applied to each element (defaults to identity)
  • op: Associative binary scan operator
  • src: Input GPU array

Keyword Arguments

  • tmp=nothing: Pre-allocated temporary buffer
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • FlagType=UInt8: Synchronization flag type

Examples

# Cumulative sum
x = CUDA.rand(Float32, 10_000)
result = scan(+, x)

# Cumulative sum of squares
result = scan(x -> x^2, +, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, similar(x), x)
result = scan(+, x; tmp)

See also: Luma.scan! for the in-place version.

source
Luma.scan!Function
scan!(f, op, dst, src; kwargs...)
scan!(op, dst, src; kwargs...)

In-place GPU parallel prefix scan using a decoupled lookback algorithm.

Applies f to each element, then computes inclusive prefix scan with op, writing results to dst.

Arguments

  • f: Map function applied to each element (defaults to identity)
  • op: Associative binary scan operator
  • dst: Output array for scan results
  • src: Input GPU array

Keyword Arguments

  • tmp=nothing: Pre-allocated temporary buffer
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = similar(x)

# Cumulative sum
scan!(+, dst, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, dst, x)
for i in 1:100
    scan!(+, dst, x; tmp)
end

See also: Luma.scan for the allocating version.

source

Matrix-Vector

Luma.matvecFunction
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> dst
matvec!([f, op,] dst, src, x; kwargs...)

Generalized matrix-vector operation with customizable element-wise and reduction operations.

Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).

The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.

Arguments

  • f: Binary operation applied element-wise (default: *)
  • op: Reduction operation across columns (default: +)
  • dst: Output vector (in-place versions only)
  • src: Input matrix
  • x: Input vector, or nothing for row-wise reduction of src alone

Keyword Arguments

  • g=identity: Unary transformation applied to each reduced row
  • tmp=nothing: Pre-allocated temporary buffer for inter-block communication
  • chunksz=nothing: Elements per thread (auto-tuned if nothing)
  • Nblocks=nothing: Number of thread blocks (auto-tuned if nothing)
  • workgroup=nothing: Threads per block (auto-tuned if nothing)
  • blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned if nothing.
  • FlagType=UInt8: Integer type for synchronization flags

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)

# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)

# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)

# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)

# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)

# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)

Extended Help

For tall matrices (many rows, few columns), each row is processed by a single block. For wide matrices (few rows, many columns), multiple blocks collaborate on each row via a number of blocks Nblocks computed from blocks_row. blocks_row is equal to Nblocks for a large row matrix.

Pre-allocating tmp avoids repeated allocation when calling matvec! in a loop. With FlagType=UInt8 (default), the flag buffer must be zeroed before each call. Using FlagType=UInt64 skips this zeroing by generating a random target flag at each call; correctness holds with probability 1 - n/2^64, which is negligible for practical n. Output element type is inferred as promote_op(g, promote_op(f, eltype(src), eltype(x))).

source
Luma.matvec!Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> dst
matvec!([f, op,] dst, src, x; kwargs...)

Generalized matrix-vector operation with customizable element-wise and reduction operations.

Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).

The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.

Arguments

  • f: Binary operation applied element-wise (default: *)
  • op: Reduction operation across columns (default: +)
  • dst: Output vector (in-place versions only)
  • src: Input matrix
  • x: Input vector, or nothing for row-wise reduction of src alone

Keyword Arguments

  • g=identity: Unary transformation applied to each reduced row
  • tmp=nothing: Pre-allocated temporary buffer for inter-block communication
  • chunksz=nothing: Elements per thread (auto-tuned if nothing)
  • Nblocks=nothing: Number of thread blocks (auto-tuned if nothing)
  • workgroup=nothing: Threads per block (auto-tuned if nothing)
  • blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned if nothing.
  • FlagType=UInt8: Integer type for synchronization flags

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)

# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)

# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)

# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)

# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)

# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)

Extended Help

For tall matrices (many rows, few columns), each row is processed by a single block. For wide matrices (few rows, many columns), multiple blocks collaborate on each row via a number of blocks Nblocks computed from blocks_row. blocks_row is equal to Nblocks for a large row matrix.

Pre-allocating tmp avoids repeated allocation when calling matvec! in a loop. With FlagType=UInt8 (default), the flag buffer must be zeroed before each call. Using FlagType=UInt64 skips this zeroing by generating a random target flag at each call; correctness holds with probability 1 - n/2^64, which is negligible for practical n. Output element type is inferred as promote_op(g, promote_op(f, eltype(src), eltype(x))).

source
Luma.vecmat!Function
vecmat!(dst, x, A; kwargs...)
vecmat!(f, op, dst, x, A; kwargs...)

GPU parallel vector-matrix multiplication: dst = g(op(f(x .* A), dims=1)).

For standard matrix-vector product: vecmat!(dst, x, A) computes dst[j] = sum(x[i] * A[i,j]). When x = nothing, computes column reductions: dst[j] = sum(A[i,j]).

Arguments

  • f=identity: Element-wise transformation applied to x[i] * A[i,j] (or A[i,j] if x=nothing)
  • op=+: Reduction operator
  • dst: Output vector of length p (number of columns)
  • x: Input vector of length n (number of rows), or nothing for pure column reduction
  • A: Input matrix of size (n, p)

Keyword Arguments

  • g=identity: Optional post-reduction transformation
  • tmp=nothing: Pre-allocated temporary buffer (from get_allocation)
  • Nitem=nothing: Number of items per thread (auto-selected if nothing)
  • Nthreads=nothing: Number of threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Maximum number of blocks
  • FlagType=UInt8: Type for synchronization flags
source

Utilities

Luma.get_allocationFunction
get_allocation(::typeof(mapreduce1d!), src; blocks=100, eltype=nothing, FlagType=UInt8)

Allocate temporary buffer for mapreduce1d!. Useful for repeated reductions.

Arguments

  • src or srcs: Input GPU array(s) (used for backend and default element type)

Keyword Arguments

  • blocks=100: Number of blocks (must match the blocks used in mapreduce1d!)
  • eltype=nothing: Element type for intermediate values. If nothing, defaults to the element type of src. For proper type inference, pass promote_op(f, T, ...).
  • FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
tmp = Luma.get_allocation(mapreduce1d!, x)
dst = CUDA.zeros(Float32, 1)

for i in 1:100
    mapreduce1d!(identity, +, dst, x; tmp)
end
source
get_allocation(::typeof(scan!), dst, src; kwargs...)

Allocate temporary buffer for scan!. Useful for repeated scans.

Arguments

  • dst: Output GPU array (used for element type of intermediates)
  • src: Input GPU array (used for backend)

Keyword Arguments

  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size (must match the workgroup used in scan!)
  • FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = similar(x)
tmp = Luma.get_allocation(scan!, dst, x)

for i in 1:100
    scan!(+, dst, x; tmp)
end
source