API Reference

Copy

Luma.vcopy! — Function

vcopy!(dst::AbstractGPUVector, src::AbstractGPUVector; Nitem=4)

Copy src to dst using vectorized GPU memory access.

Performs a high-throughput copy by loading and storing Nitem elements per thread, reducing memory transaction overhead compared to scalar copies.

Arguments

dst: Destination GPU vector
src: Source GPU vector (must have same length as dst)
Nitem=4: Number of elements processed per thread. Higher values improve throughput but require length(src) to be divisible by Nitem.

Example

src = CUDA.rand(Float32, 1024)
dst = CUDA.zeros(Float32, 1024)
vcopy!(dst, src)

Map-Reduce

Luma.mapreduce — Function

mapreduce(f, op, src::AbstractGPUArray; dims=nothing, kwargs...) -> Array or scalar

GPU parallel map-reduce operation with optional dimension reduction.

Arguments

f: Map function applied to each element
op: Associative binary reduction operator
src: Input GPU array

Keyword Arguments

dims=nothing: Dimensions to reduce over. Options:
- nothing or :: Reduce over all dimensions (returns scalar or 1-element array)
- Int: Reduce over single dimension
- Tuple{Int...}: Reduce over multiple dimensions (must be contiguous from start or end)
g=identity: Post-reduction transformation
init=nothing: Initial value (currently unused, for API compatibility)
to_cpu=false: If true and dims=nothing, return scalar; otherwise return GPU array
Additional kwargs passed to underlying implementations

Dimension Constraints

The dims argument must specify contiguous dimensions from either:

The beginning: (1,), (1,2), (1,2,3), etc.
The end: (n-1,n), (n,), etc. for an n-dimensional array

Examples

A = CUDA.rand(Float32, 100, 50, 20)

# Full reduction (all dimensions)
total = mapreduce(identity, +, A; to_cpu=true)

# Reduce along dim 1: (100, 50, 20) -> (50, 20)
col_sums = mapreduce(identity, +, A; dims=1)

# Reduce along dims (1,2): (100, 50, 20) -> (20,)
plane_sums = mapreduce(identity, +, A; dims=(1,2))

# Reduce along last dim: (100, 50, 20) -> (100, 50)
depth_sums = mapreduce(identity, +, A; dims=3)

# Reduce along last two dims: (100, 50, 20) -> (100,)
slice_sums = mapreduce(identity, +, A; dims=(2,3))

source

mapreduce(f, op, srcs::NTuple{N,AbstractGPUArray}; dims=nothing, kwargs...)

Multi-array mapreduce. Only supports full reduction (dims=nothing).

source

Luma.mapreduce! — Function

mapreduce!(f, op, dst, src; dims=nothing, kwargs...)

In-place GPU parallel map-reduce with dimension support.

Arguments

f: Map function applied to each element
op: Associative binary reduction operator
dst: Output array
src: Input GPU array

Keyword Arguments

dims=nothing: Dimensions to reduce over (see mapreduce for details)
g=identity: Post-reduction transformation
Additional kwargs passed to underlying implementations

Examples

A = CUDA.rand(Float32, 100, 50)
col_sums = CUDA.zeros(Float32, 50)
row_sums = CUDA.zeros(Float32, 100)

# Column sums (reduce dim 1)
mapreduce!(identity, +, col_sums, A; dims=1)

# Row sums (reduce dim 2)
mapreduce!(identity, +, row_sums, A; dims=2)

See also: Luma.mapreduce

source

Luma.mapreduce2d — Function

mapreduce2d(f, op, src, dim; kwargs...) -> Vector

GPU parallel reduction along dimension dim.

dim=1: Column-wise reduction (vertical), output length = number of columns
dim=2: Row-wise reduction (horizontal), output length = number of rows

Arguments

f: Element-wise transformation
op: Reduction operator
src: Input matrix of size (n, p)
dim: Dimension to reduce along (1 or 2)

Keyword Arguments

g=identity: Post-reduction transformation
tmp=nothing: Pre-allocated temporary buffer
FlagType=UInt8: Synchronization flag type

For dim=1 (column-wise):

Nitem=nothing: Items per thread
Nthreads=nothing: Threads per column reduction
workgroup=nothing: Workgroup size
blocks=nothing: Number of blocks

For dim=2 (row-wise):

chunksz=nothing: Chunk size for row processing
Nblocks=nothing: Number of blocks per row
workgroup=nothing: Workgroup size
blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)

# Column sums (reduce along dim=1)
col_sums = mapreduce2d(identity, +, A, 1)

# Row maximums (reduce along dim=2)
row_maxs = mapreduce2d(identity, max, A, 2)

# Column means
col_means = mapreduce2d(identity, +, A, 1; g=x -> x / size(A, 1))

# Sum of squares per row
row_ss = mapreduce2d(abs2, +, A, 2)

See also: Luma.mapreduce2d! for the in-place version.

source

Luma.mapreduce2d! — Function

mapreduce2d!(f, op, dst, src, dim; kwargs...)

In-place GPU parallel reduction along dimension dim.

dim=1: Column-wise reduction (vertical), dst length = number of columns
dim=2: Row-wise reduction (horizontal), dst length = number of rows

Arguments

f: Element-wise transformation
op: Reduction operator
dst: Output vector
src: Input matrix of size (n, p)
dim: Dimension to reduce along (1 or 2)

Keyword Arguments

g=identity: Post-reduction transformation
tmp=nothing: Pre-allocated temporary buffer
FlagType=UInt8: Synchronization flag type

For dim=1 (column-wise):

Nitem=nothing: Items per thread
Nthreads=nothing: Threads per column reduction
workgroup=nothing: Workgroup size
blocks=nothing: Number of blocks

For dim=2 (row-wise):

chunksz=nothing: Chunk size for row processing
Nblocks=nothing: Number of blocks per row
workgroup=nothing: Workgroup size
blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)
col_sums = CUDA.zeros(Float32, 500)
row_maxs = CUDA.zeros(Float32, 1000)

# Column sums
mapreduce2d!(identity, +, col_sums, A, 1)

# Row maximums
mapreduce2d!(identity, max, row_maxs, A, 2)

See also: Luma.mapreduce2d for the allocating version.

source

Luma.mapreduce1d — Function

mapreduce1d(f, op, src; kwargs...) -> GPU array or scalar
mapreduce1d(f, op, srcs::NTuple; kwargs...) -> GPU array or scalar

GPU parallel map-reduce operation.

Applies f to each element, reduces with op, and optionally applies g to the final result.

Arguments

f: Map function applied to each element
op: Associative binary reduction operator
src or srcs: Input GPU array(s)

Keyword Arguments

g=identity: Post-reduction transformation applied to final result
tmp=nothing: Pre-allocated temporary buffer
Nitem=nothing: Items per thread (auto-selected if nothing)
workgroup=256: Workgroup size
blocks=100: Number of blocks
FlagType=UInt8: Synchronization flag type
to_cpu=false: If true, return scalar; otherwise return 1-element GPU array

Examples

# Sum of squares (returns GPU array)
x = CUDA.rand(Float32, 10_000)
result = mapreduce1d(x -> x^2, +, x)

# Sum of squares (returns scalar)
result = mapreduce1d(x -> x^2, +, x; to_cpu=true)

# Dot product of two arrays
x, y = CUDA.rand(Float32, 10_000), CUDA.rand(Float32, 10_000)
result = mapreduce1d((a, b) -> a * b, +, (x, y); to_cpu=true)

See also: Luma.mapreduce1d! for the in-place version.

source

Luma.mapreduce1d! — Function

mapreduce1d!(f, op, dst, src; kwargs...)
mapreduce1d!(f, op, dst, srcs::NTuple; kwargs...)

In-place GPU parallel map-reduce, writing result to dst[1].

Arguments

f: Map function applied to each element
op: Associative binary reduction operator
dst: Output array (result written to first element)
src or srcs: Input GPU array(s)

Keyword Arguments

g=identity: Post-reduction transformation applied to final result
tmp=nothing: Pre-allocated temporary buffer
Nitem=nothing: Items per thread (auto-selected if nothing)
workgroup=256: Workgroup size
blocks=100: Number of blocks
FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Float32, 1)

# Sum
mapreduce1d!(identity, +, dst, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(mapreduce1d!, x)
for i in 1:100
    mapreduce1d!(identity, +, dst, x; tmp)
end

See also: Luma.mapreduce1d for the allocating version.

source

Scan

Luma.scan — Function

scan(f, op, src; kwargs...) -> GPU array
scan(op, src; kwargs...) -> GPU array

GPU parallel prefix scan (cumulative reduction) using a decoupled lookback algorithm.

Applies f to each element, then computes inclusive prefix scan with op.

Arguments

f: Map function applied to each element (defaults to identity)
op: Associative binary scan operator
src: Input GPU array

Keyword Arguments

tmp=nothing: Pre-allocated temporary buffer
Nitem=nothing: Items per thread (auto-selected if nothing)
workgroup=256: Workgroup size
FlagType=UInt8: Synchronization flag type

Examples

# Cumulative sum
x = CUDA.rand(Float32, 10_000)
result = scan(+, x)

# Cumulative sum of squares
result = scan(x -> x^2, +, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, similar(x), x)
result = scan(+, x; tmp)

See also: Luma.scan! for the in-place version.

source

Luma.scan! — Function

scan!(f, op, dst, src; kwargs...)
scan!(op, dst, src; kwargs...)

In-place GPU parallel prefix scan using a decoupled lookback algorithm.

Applies f to each element, then computes inclusive prefix scan with op, writing results to dst.

Arguments

f: Map function applied to each element (defaults to identity)
op: Associative binary scan operator
dst: Output array for scan results
src: Input GPU array

Keyword Arguments

tmp=nothing: Pre-allocated temporary buffer
Nitem=nothing: Items per thread (auto-selected if nothing)
workgroup=256: Workgroup size
FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = similar(x)

# Cumulative sum
scan!(+, dst, x)

# With pre-allocated temporary for repeated calls
tmp = Luma.get_allocation(scan!, dst, x)
for i in 1:100
    scan!(+, dst, x; tmp)
end

See also: Luma.scan for the allocating version.

source

For tall matrices (many rows, few columns), each row is processed by a single block. For wide matrices (few rows, many columns), multiple blocks collaborate on each row via a number of blocks Nblocks computed from blocks_row. blocks_row is equal to Nblocks for a large row matrix.

Pre-allocating tmp avoids repeated allocation when calling matvec! in a loop. With FlagType=UInt8 (default), the flag buffer must be zeroed before each call. Using FlagType=UInt64 skips this zeroing by generating a random target flag at each call; correctness holds with probability 1 - n/2^64, which is negligible for practical n. Output element type is inferred as promote_op(g, promote_op(f, eltype(src), eltype(x))).

source

Luma.matvec! — Function

matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> dst
matvec!([f, op,] dst, src, x; kwargs...)

Generalized matrix-vector operation with customizable element-wise and reduction operations.

Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).

The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.

Arguments

f: Binary operation applied element-wise (default: *)
op: Reduction operation across columns (default: +)
dst: Output vector (in-place versions only)
src: Input matrix
x: Input vector, or nothing for row-wise reduction of src alone

Keyword Arguments

g=identity: Unary transformation applied to each reduced row
tmp=nothing: Pre-allocated temporary buffer for inter-block communication
chunksz=nothing: Elements per thread (auto-tuned if nothing)
Nblocks=nothing: Number of thread blocks (auto-tuned if nothing)
workgroup=nothing: Threads per block (auto-tuned if nothing)
blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned if nothing.
FlagType=UInt8: Integer type for synchronization flags

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)

# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)

# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)

# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)

# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)

# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)

Extended Help

source

Luma.vecmat! — Function

vecmat!(dst, x, A; kwargs...)
vecmat!(f, op, dst, x, A; kwargs...)

GPU parallel vector-matrix multiplication: dst = g(op(f(x .* A), dims=1)).

For standard matrix-vector product: vecmat!(dst, x, A) computes dst[j] = sum(x[i] * A[i,j]). When x = nothing, computes column reductions: dst[j] = sum(A[i,j]).

Arguments

f=identity: Element-wise transformation applied to x[i] * A[i,j] (or A[i,j] if x=nothing)
op=+: Reduction operator
dst: Output vector of length p (number of columns)
x: Input vector of length n (number of rows), or nothing for pure column reduction
A: Input matrix of size (n, p)

Keyword Arguments

g=identity: Optional post-reduction transformation
tmp=nothing: Pre-allocated temporary buffer (from get_allocation)
Nitem=nothing: Number of items per thread (auto-selected if nothing)
Nthreads=nothing: Number of threads per column reduction
workgroup=nothing: Workgroup size
blocks=nothing: Maximum number of blocks
FlagType=UInt8: Type for synchronization flags

source

Utilities

Luma.get_allocation — Function

get_allocation(::typeof(mapreduce1d!), src; blocks=100, eltype=nothing, FlagType=UInt8)

Allocate temporary buffer for mapreduce1d!. Useful for repeated reductions.

Arguments

src or srcs: Input GPU array(s) (used for backend and default element type)

Keyword Arguments

blocks=100: Number of blocks (must match the blocks used in mapreduce1d!)
eltype=nothing: Element type for intermediate values. If nothing, defaults to the element type of src. For proper type inference, pass promote_op(f, T, ...).
FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
tmp = Luma.get_allocation(mapreduce1d!, x)
dst = CUDA.zeros(Float32, 1)

for i in 1:100
    mapreduce1d!(identity, +, dst, x; tmp)
end

source

get_allocation(::typeof(scan!), dst, src; kwargs...)

Allocate temporary buffer for scan!. Useful for repeated scans.

Arguments

dst: Output GPU array (used for element type of intermediates)
src: Input GPU array (used for backend)

Keyword Arguments

Nitem=nothing: Items per thread (auto-selected if nothing)
workgroup=256: Workgroup size (must match the workgroup used in scan!)
FlagType=UInt8: Synchronization flag type

Examples

x = CUDA.rand(Float32, 10_000)
dst = similar(x)
tmp = Luma.get_allocation(scan!, dst, x)

for i in 1:100
    scan!(+, dst, x; tmp)
end

source

API Reference

Copy

Map-Reduce

Scan

Matrix-Vector

Utilities