API Reference
Copy
Luma.vcopy! — Function
vcopy!(dst::AbstractGPUVector, src::AbstractGPUVector; Nitem=4)Copy src to dst using vectorized GPU memory access.
Performs a high-throughput copy by loading and storing Nitem elements per thread, reducing memory transaction overhead compared to scalar copies.
Arguments
dst: Destination GPU vectorsrc: Source GPU vector (must have same length asdst)Nitem=4: Number of elements processed per thread. Higher values improve throughput but requirelength(src)to be divisible byNitem.
Example
src = CUDA.rand(Float32, 1024)
dst = CUDA.zeros(Float32, 1024)
vcopy!(dst, src)See also: Luma.setvalue!
Luma.setvalue! — Function
setvalue!(dst::AbstractGPUVector{T}, val::T; Nitem=4) where TFill dst with val using vectorized GPU memory access.
Performs a high-throughput fill by storing Nitem copies of val per thread, reducing memory transaction overhead compared to scalar writes.
Arguments
dst: Destination GPU vectorval: Value to fill (must match element type ofdst)Nitem=4: Number of elements written per thread. Higher values improve throughput but requirelength(dst)to be divisible byNitem.
Example
dst = CUDA.zeros(Float32, 1024)
setvalue!(dst, 1.0f0)See also: Luma.vcopy!
Map-Reduce
Luma.mapreduce — Function
mapreduce(f, op, src::AbstractGPUArray; dims=nothing, kwargs...) -> Array or scalarGPU parallel map-reduce operation with optional dimension reduction.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatorsrc: Input GPU array
Keyword Arguments
dims=nothing: Dimensions to reduce over. Options:nothingor:: Reduce over all dimensions (returns scalar or 1-element array)Int: Reduce over single dimensionTuple{Int...}: Reduce over multiple dimensions (must be contiguous from start or end)
g=identity: Post-reduction transformationinit=nothing: Initial value (currently unused, for API compatibility)to_cpu=false: If true anddims=nothing, return scalar; otherwise return GPU array- Additional kwargs passed to underlying implementations
Dimension Constraints
The dims argument must specify contiguous dimensions from either:
- The beginning:
(1,),(1,2),(1,2,3), etc. - The end:
(n-1,n),(n,), etc. for an n-dimensional array
Examples
A = CUDA.rand(Float32, 100, 50, 20)
# Full reduction (all dimensions)
total = mapreduce(identity, +, A; to_cpu=true)
# Reduce along dim 1: (100, 50, 20) -> (50, 20)
col_sums = mapreduce(identity, +, A; dims=1)
# Reduce along dims (1,2): (100, 50, 20) -> (20,)
plane_sums = mapreduce(identity, +, A; dims=(1,2))
# Reduce along last dim: (100, 50, 20) -> (100, 50)
depth_sums = mapreduce(identity, +, A; dims=3)
# Reduce along last two dims: (100, 50, 20) -> (100,)
slice_sums = mapreduce(identity, +, A; dims=(2,3))See also: Luma.mapreduce!, mapreduce1d, mapreduce2d
mapreduce(f, op, srcs::NTuple{N,AbstractGPUArray}; dims=nothing, kwargs...)Multi-array mapreduce. Only supports full reduction (dims=nothing).
Luma.mapreduce! — Function
mapreduce!(f, op, dst, src; dims=nothing, kwargs...)In-place GPU parallel map-reduce with dimension support.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatordst: Output arraysrc: Input GPU array
Keyword Arguments
dims=nothing: Dimensions to reduce over (seemapreducefor details)g=identity: Post-reduction transformation- Additional kwargs passed to underlying implementations
Examples
A = CUDA.rand(Float32, 100, 50)
col_sums = CUDA.zeros(Float32, 50)
row_sums = CUDA.zeros(Float32, 100)
# Column sums (reduce dim 1)
mapreduce!(identity, +, col_sums, A; dims=1)
# Row sums (reduce dim 2)
mapreduce!(identity, +, row_sums, A; dims=2)See also: Luma.mapreduce
Luma.mapreduce2d — Function
mapreduce2d(f, op, src, dim; kwargs...) -> VectorGPU parallel reduction along dimension dim.
dim=1: Column-wise reduction (vertical), output length = number of columnsdim=2: Row-wise reduction (horizontal), output length = number of rows
Arguments
f: Element-wise transformationop: Reduction operatorsrc: Input matrix of size(n, p)dim: Dimension to reduce along (1 or 2)
Keyword Arguments
g=identity: Post-reduction transformationtmp=nothing: Pre-allocated temporary bufferFlagType=UInt8: Synchronization flag type
For dim=1 (column-wise):
Nitem=nothing: Items per threadNthreads=nothing: Threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Number of blocks
For dim=2 (row-wise):
chunksz=nothing: Chunk size for row processingNblocks=nothing: Number of blocks per rowworkgroup=nothing: Workgroup sizeblocks_row=nothing: Blocks per row
Examples
A = CUDA.rand(Float32, 1000, 500)
# Column sums (reduce along dim=1)
col_sums = mapreduce2d(identity, +, A, 1)
# Row maximums (reduce along dim=2)
row_maxs = mapreduce2d(identity, max, A, 2)
# Column means
col_means = mapreduce2d(identity, +, A, 1; g=x -> x / size(A, 1))
# Sum of squares per row
row_ss = mapreduce2d(abs2, +, A, 2)See also: Luma.mapreduce2d! for the in-place version.
Luma.mapreduce2d! — Function
mapreduce2d!(f, op, dst, src, dim; kwargs...)In-place GPU parallel reduction along dimension dim.
dim=1: Column-wise reduction (vertical),dstlength = number of columnsdim=2: Row-wise reduction (horizontal),dstlength = number of rows
Arguments
f: Element-wise transformationop: Reduction operatordst: Output vectorsrc: Input matrix of size(n, p)dim: Dimension to reduce along (1 or 2)
Keyword Arguments
g=identity: Post-reduction transformationtmp=nothing: Pre-allocated temporary bufferFlagType=UInt8: Synchronization flag type
For dim=1 (column-wise):
Nitem=nothing: Items per threadNthreads=nothing: Threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Number of blocks
For dim=2 (row-wise):
chunksz=nothing: Chunk size for row processingNblocks=nothing: Number of blocks per rowworkgroup=nothing: Workgroup sizeblocks_row=nothing: Blocks per row
Examples
A = CUDA.rand(Float32, 1000, 500)
col_sums = CUDA.zeros(Float32, 500)
row_maxs = CUDA.zeros(Float32, 1000)
# Column sums
mapreduce2d!(identity, +, col_sums, A, 1)
# Row maximums
mapreduce2d!(identity, max, row_maxs, A, 2)See also: Luma.mapreduce2d for the allocating version.
Luma.mapreduce1d — Function
mapreduce1d(f, op, src; kwargs...) -> GPU array or scalar
mapreduce1d(f, op, srcs::NTuple; kwargs...) -> GPU array or scalarGPU parallel map-reduce operation.
Applies f to each element, reduces with op, and optionally applies g to the final result.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatorsrcorsrcs: Input GPU array(s)
Keyword Arguments
g=identity: Post-reduction transformation applied to final resulttmp=nothing: Pre-allocated temporary bufferNitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocksFlagType=UInt8: Synchronization flag typeto_cpu=false: If true, return scalar; otherwise return 1-element GPU array
Examples
# Sum of squares (returns GPU array)
x = CUDA.rand(Float32, 10_000)
result = mapreduce1d(x -> x^2, +, x)
# Sum of squares (returns scalar)
result = mapreduce1d(x -> x^2, +, x; to_cpu=true)
# Dot product of two arrays
x, y = CUDA.rand(Float32, 10_000), CUDA.rand(Float32, 10_000)
result = mapreduce1d((a, b) -> a * b, +, (x, y); to_cpu=true)See also: Luma.mapreduce1d! for the in-place version.
Luma.mapreduce1d! — Function
mapreduce1d!(f, op, dst, src; kwargs...)
mapreduce1d!(f, op, dst, srcs::NTuple; kwargs...)In-place GPU parallel map-reduce, writing result to dst[1].
Arguments
f: Map function applied to each elementop: Associative binary reduction operatordst: Output array (result written to first element)srcorsrcs: Input GPU array(s)
Keyword Arguments
g=identity: Post-reduction transformation applied to final resulttmp=nothing: Pre-allocated temporary bufferNitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocksFlagType=UInt8: Synchronization flag type
Examples
x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Float32, 1)
# Sum
mapreduce1d!(identity, +, dst, x)
# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(mapreduce1d!, x)
for i in 1:100
mapreduce1d!(identity, +, dst, x; tmp)
endSee also: Luma.mapreduce1d for the allocating version.
Scan
Luma.scan — Function
scan(f, op, src; kwargs...) -> GPU array
scan(op, src; kwargs...) -> GPU arrayGPU parallel prefix scan (cumulative reduction) using a decoupled lookback algorithm.
Applies f to each element, then computes inclusive prefix scan with op.
Arguments
f: Map function applied to each element (defaults toidentity)op: Associative binary scan operatorsrc: Input GPU array
Keyword Arguments
tmp=nothing: Pre-allocated temporary bufferNitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeFlagType=UInt8: Synchronization flag type
Examples
# Cumulative sum
x = CUDA.rand(Float32, 10_000)
result = scan(+, x)
# Cumulative sum of squares
result = scan(x -> x^2, +, x)
# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, similar(x), x)
result = scan(+, x; tmp)See also: Luma.scan! for the in-place version.
Luma.scan! — Function
scan!(f, op, dst, src; kwargs...)
scan!(op, dst, src; kwargs...)In-place GPU parallel prefix scan using a decoupled lookback algorithm.
Applies f to each element, then computes inclusive prefix scan with op, writing results to dst.
Arguments
f: Map function applied to each element (defaults toidentity)op: Associative binary scan operatordst: Output array for scan resultssrc: Input GPU array
Keyword Arguments
tmp=nothing: Pre-allocated temporary bufferNitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeFlagType=UInt8: Synchronization flag type
Examples
x = CUDA.rand(Float32, 10_000)
dst = similar(x)
# Cumulative sum
scan!(+, dst, x)
# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, dst, x)
for i in 1:100
scan!(+, dst, x; tmp)
endSee also: Luma.scan for the allocating version.
Matrix-Vector
Luma.matvec — Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> dst
matvec!([f, op,] dst, src, x; kwargs...)Generalized matrix-vector operation with customizable element-wise and reduction operations.
Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).
The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.
Arguments
f: Binary operation applied element-wise (default:*)op: Reduction operation across columns (default:+)dst: Output vector (in-place versions only)src: Input matrixx: Input vector, ornothingfor row-wise reduction ofsrcalone
Keyword Arguments
g=identity: Unary transformation applied to each reduced rowtmp=nothing: Pre-allocated temporary buffer for inter-block communicationchunksz=nothing: Elements per thread (auto-tuned ifnothing)Nblocks=nothing: Number of thread blocks (auto-tuned ifnothing)workgroup=nothing: Threads per block (auto-tuned ifnothing)blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned ifnothing.FlagType=UInt8: Integer type for synchronization flags
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)
# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)
# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)
# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)
# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)Extended Help
For tall matrices (many rows, few columns), each row is processed by a single block. For wide matrices (few rows, many columns), multiple blocks collaborate on each row via a number of blocks Nblocks computed from blocks_row. blocks_row is equal to Nblocks for a large row matrix.
Pre-allocating tmp avoids repeated allocation when calling matvec! in a loop. With FlagType=UInt8 (default), the flag buffer must be zeroed before each call. Using FlagType=UInt64 skips this zeroing by generating a random target flag at each call; correctness holds with probability 1 - n/2^64, which is negligible for practical n. Output element type is inferred as promote_op(g, promote_op(f, eltype(src), eltype(x))).
Luma.matvec! — Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> dst
matvec!([f, op,] dst, src, x; kwargs...)Generalized matrix-vector operation with customizable element-wise and reduction operations.
Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).
The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.
Arguments
f: Binary operation applied element-wise (default:*)op: Reduction operation across columns (default:+)dst: Output vector (in-place versions only)src: Input matrixx: Input vector, ornothingfor row-wise reduction ofsrcalone
Keyword Arguments
g=identity: Unary transformation applied to each reduced rowtmp=nothing: Pre-allocated temporary buffer for inter-block communicationchunksz=nothing: Elements per thread (auto-tuned ifnothing)Nblocks=nothing: Number of thread blocks (auto-tuned ifnothing)workgroup=nothing: Threads per block (auto-tuned ifnothing)blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned ifnothing.FlagType=UInt8: Integer type for synchronization flags
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)
# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)
# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)
# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)
# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)Extended Help
For tall matrices (many rows, few columns), each row is processed by a single block. For wide matrices (few rows, many columns), multiple blocks collaborate on each row via a number of blocks Nblocks computed from blocks_row. blocks_row is equal to Nblocks for a large row matrix.
Pre-allocating tmp avoids repeated allocation when calling matvec! in a loop. With FlagType=UInt8 (default), the flag buffer must be zeroed before each call. Using FlagType=UInt64 skips this zeroing by generating a random target flag at each call; correctness holds with probability 1 - n/2^64, which is negligible for practical n. Output element type is inferred as promote_op(g, promote_op(f, eltype(src), eltype(x))).
Luma.vecmat! — Function
vecmat!(dst, x, A; kwargs...)
vecmat!(f, op, dst, x, A; kwargs...)GPU parallel vector-matrix multiplication: dst = g(op(f(x .* A), dims=1)).
For standard matrix-vector product: vecmat!(dst, x, A) computes dst[j] = sum(x[i] * A[i,j]). When x = nothing, computes column reductions: dst[j] = sum(A[i,j]).
Arguments
f=identity: Element-wise transformation applied tox[i] * A[i,j](orA[i,j]ifx=nothing)op=+: Reduction operatordst: Output vector of lengthp(number of columns)x: Input vector of lengthn(number of rows), ornothingfor pure column reductionA: Input matrix of size(n, p)
Keyword Arguments
g=identity: Optional post-reduction transformationtmp=nothing: Pre-allocated temporary buffer (fromget_allocation)Nitem=nothing: Number of items per thread (auto-selected if nothing)Nthreads=nothing: Number of threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Maximum number of blocksFlagType=UInt8: Type for synchronization flags
Utilities
Luma.get_allocation — Function
get_allocation(::typeof(mapreduce1d!), src; blocks=100, eltype=nothing, FlagType=UInt8)Allocate temporary buffer for mapreduce1d!. Useful for repeated reductions.
Arguments
srcorsrcs: Input GPU array(s) (used for backend and default element type)
Keyword Arguments
blocks=100: Number of blocks (must match theblocksused inmapreduce1d!)eltype=nothing: Element type for intermediate values. Ifnothing, defaults to the element type ofsrc. For proper type inference, passpromote_op(f, T, ...).FlagType=UInt8: Synchronization flag type
Examples
x = CUDA.rand(Float32, 10_000)
tmp = Luma.get_allocation(mapreduce1d!, x)
dst = CUDA.zeros(Float32, 1)
for i in 1:100
mapreduce1d!(identity, +, dst, x; tmp)
endget_allocation(::typeof(scan!), dst, src; kwargs...)Allocate temporary buffer for scan!. Useful for repeated scans.
Arguments
dst: Output GPU array (used for element type of intermediates)src: Input GPU array (used for backend)
Keyword Arguments
Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup size (must match theworkgroupused inscan!)FlagType=UInt8: Synchronization flag type
Examples
x = CUDA.rand(Float32, 10_000)
dst = similar(x)
tmp = Luma.get_allocation(scan!, dst, x)
for i in 1:100
scan!(+, dst, x; tmp)
end