API Reference

Copy

KernelForge.vcopy!Function
vcopy!(dst::AbstractGPUVector, src::AbstractGPUVector; Nitem=4)

Copy src to dst using vectorized GPU memory access.

Performs a high-throughput copy by loading and storing Nitem elements per thread, reducing memory transaction overhead compared to scalar copies.

Arguments

  • dst: Destination GPU vector
  • src: Source GPU vector (must have same length as dst)
  • Nitem=4: Number of elements processed per thread. Higher values improve throughput but require length(src) to be divisible by Nitem.

Example

src = CUDA.rand(Float32, 1024)
dst = CUDA.zeros(Float32, 1024)
vcopy!(dst, src)

See also: KernelForge.setvalue!

source
KernelForge.setvalue!Function
setvalue!(dst::AbstractGPUVector{T}, val::T; Nitem=4) where T

Fill dst with val using vectorized GPU memory access.

Performs a high-throughput fill by storing Nitem copies of val per thread, reducing memory transaction overhead compared to scalar writes.

Arguments

  • dst: Destination GPU vector
  • val: Value to fill (must match element type of dst)
  • Nitem=4: Number of elements written per thread. Higher values improve throughput but require length(dst) to be divisible by Nitem.

Example

dst = CUDA.zeros(Float32, 1024)
setvalue!(dst, 1.0f0)

See also: KernelForge.vcopy!

source

Map-Reduce

KernelForge.mapreduceFunction
mapreduce(f, op, src::AbstractArray; dims=nothing, kwargs...) -> scalar or GPU array

GPU parallel map-reduce operation with optional dimension reduction.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • src: Input GPU array

Keyword Arguments

  • dims=nothing: Dimensions to reduce over. Options:
    • nothing or :: Reduce over all dimensions → scalar
    • Int or Tuple{Int...}: Reduce over those dims → GPU array
  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated buffer — a MapReduceBuffer (full reduction), VecMatBuffer (dim=1 reduction), or MatVecBuffer (dim=2 reduction)
  • Additional kwargs passed to underlying implementations

Fast paths

  • Full reduction (dims=nothing) → mapreduce1d
  • All dims explicit → mapreducedims
  • Contiguous leading dims (1,...,k) → reshape, mapreduce2d on dim 1, reshape back
  • Contiguous trailing dims (k,...,n) → reshape, mapreduce2d on dim 2, reshape back
  • Both leading and trailing contiguous blocks → two mapreduce2d passes
  • General dims → mapreducedims fallback

Examples

A = CUDA.rand(Float32, 100, 50, 20)

# Full reduction → scalar
total = mapreduce(identity, +, A)

# Reduce along dim 1: (100, 50, 20) -> (1, 50, 20)
col_sums = mapreduce(identity, +, A; dims=1)

# Reduce along dims (1,2): (100, 50, 20) -> (1, 1, 20)
plane_sums = mapreduce(identity, +, A; dims=(1,2))

# Reduce along last dim: (100, 50, 20) -> (100, 50, 1)
depth_sums = mapreduce(identity, +, A; dims=3)

See also: KernelForge.mapreduce!, mapreduce1d, mapreduce2d, mapreducedims

source
mapreduce(f, op, srcs::NTuple; kwargs...)

Multi-array mapreduce. Only supports full reduction (dims=nothing).

Keyword Arguments

  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated MapReduceBuffer
  • Additional kwargs passed to mapreduce1d
source
KernelForge.mapreduce!Function
mapreduce!(f, op, dst, src; dims=nothing, kwargs...)

In-place GPU parallel map-reduce with dimension support.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • dst: Output array
  • src: Input GPU array

Keyword Arguments

  • dims=nothing: Dimensions to reduce over. Options:
    • nothing or :: Reduce over all dimensions → writes to dst[1]
    • Int or Tuple{Int...}: Reduce over those dims → writes to dst
  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated buffer — a MapReduceBuffer (full reduction), VecMatBuffer (dim=1 reduction), or MatVecBuffer (dim=2 reduction)
  • Additional kwargs passed to underlying implementations

Examples

A = CUDA.rand(Float32, 100, 50)
dst = CUDA.zeros(Float32, 1, 50)
mapreduce!(identity, +, dst, A; dims=1)

See also: KernelForge.mapreduce

source
KernelForge.mapreduce2dFunction
mapreduce2d(f, op, src, dim; kwargs...) -> Vector

GPU parallel reduction along dimension dim.

  • dim=1: Column-wise reduction (vertical), output length = number of columns
  • dim=2: Row-wise reduction (horizontal), output length = number of rows

Arguments

  • f: Element-wise transformation
  • op: Reduction operator
  • src: Input matrix of size (n, p)
  • dim: Dimension to reduce along (1 or 2)

Keyword Arguments

  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)

For dim=1 (column-wise):

  • Nitem=nothing: Items per thread
  • Nthreads=nothing: Threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Number of blocks

For dim=2 (row-wise):

  • chunksz=nothing: Chunk size for row processing
  • Nblocks=nothing: Number of blocks per row
  • workgroup=nothing: Workgroup size
  • blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)

# Column sums (reduce along dim=1)
col_sums = mapreduce2d(identity, +, A, 1)

# Row maximums (reduce along dim=2)
row_maxs = mapreduce2d(identity, max, A, 2)

# Column means
col_means = mapreduce2d(identity, +, A, 1; g=x -> x / size(A, 1))

# Sum of squares per row
row_ss = mapreduce2d(abs2, +, A, 2)

See also: KernelForge.mapreduce2d! for the in-place version.

source
KernelForge.mapreduce2d!Function
mapreduce2d!(f, op, dst, src, dim; kwargs...)

In-place GPU parallel reduction along dimension dim.

  • dim=1: Column-wise reduction (vertical), dst length = number of columns
  • dim=2: Row-wise reduction (horizontal), dst length = number of rows

Arguments

  • f: Element-wise transformation
  • op: Reduction operator
  • dst: Output vector
  • src: Input matrix of size (n, p)
  • dim: Dimension to reduce along (1 or 2)

Keyword Arguments

  • g=identity: Post-reduction transformation
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)

For dim=1 (column-wise):

  • Nitem=nothing: Items per thread
  • Nthreads=nothing: Threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Number of blocks

For dim=2 (row-wise):

  • chunksz=nothing: Chunk size for row processing
  • Nblocks=nothing: Number of blocks per row
  • workgroup=nothing: Workgroup size
  • blocks_row=nothing: Blocks per row

Examples

A = CUDA.rand(Float32, 1000, 500)
col_sums = CUDA.zeros(Float32, 500)
row_maxs = CUDA.zeros(Float32, 1000)

# Column sums
mapreduce2d!(identity, +, col_sums, A, 1)

# Row maximums
mapreduce2d!(identity, max, row_maxs, A, 2)

See also: KernelForge.mapreduce2d for the allocating version.

source
KernelForge.mapreduce1dFunction
mapreduce1d(f, op, src; kwargs...) -> scalar or GPU array
mapreduce1d(f, op, srcs::NTuple; kwargs...) -> scalar or GPU array

GPU parallel map-reduce operation.

Applies f to each element, reduces with op, and optionally applies g to the final result.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to final result
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks
  • to_cpu=true: If true, return scalar; otherwise return 1-element GPU array

Examples

# Sum of squares
x = CUDA.rand(Float32, 10_000)
result = mapreduce1d(x -> x^2, +, x)

# Return GPU array instead of scalar
result = mapreduce1d(x -> x^2, +, x; to_cpu=false)

# Dot product of two arrays
x, y = CUDA.rand(Float32, 10_000), CUDA.rand(Float32, 10_000)
result = mapreduce1d((a, b) -> a * b, +, (x, y))

See also: KernelForge.mapreduce1d! for the in-place version.

source
KernelForge.mapreduce1d!Function
mapreduce1d!(f, op, dst, src; kwargs...)
mapreduce1d!(f, op, dst, srcs::NTuple; kwargs...)

In-place GPU parallel map-reduce, writing result to dst[1].

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • dst: Output array (result written to first element)
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to final result
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks

Examples

x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Float32, 1)

# Sum
mapreduce1d!(identity, +, dst, x)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MapReduce1D, x -> x^2, +, x)
for i in 1:100
    mapreduce1d!(x -> x^2, +, dst, x; tmp)
end

See also: KernelForge.mapreduce1d for the allocating version.

source
KernelForge.mapreducedimsFunction
mapreducedims(f, op, src, dims; kwargs...) -> GPU array

GPU parallel map-reduce over specified dimensions.

Applies f to each element, reduces along dims with op, and optionally applies g to each final element.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • src: Input GPU array
  • dims: Dimension(s) to reduce over (Int or tuple of Ints)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to each result element
  • workgroup=256: Workgroup size

Examples

# Sum along rows (reduce dim 1)
x = CUDA.rand(Float32, 128, 64)
result = mapreducedims(identity, +, x, 1)   # shape: (1, 64)

# Sum of squares along columns (reduce dim 2)
result = mapreducedims(x -> x^2, +, x, 2)  # shape: (128, 1)

# Reduce multiple dimensions
x = CUDA.rand(Float32, 4, 8, 16)
result = mapreducedims(identity, +, x, (1, 3))  # shape: (1, 8, 1)

See also: KernelForge.mapreducedims! for the in-place version.

source
KernelForge.mapreducedims!Function
mapreducedims!(f, op, dst, src, dims; kwargs...)

In-place GPU parallel map-reduce over specified dimensions, writing result to dst.

Arguments

  • f: Map function applied to each element
  • op: Associative binary reduction operator
  • dst: Output array (must have size 1 along each reduced dimension)
  • src: Input GPU array
  • dims: Dimension(s) to reduce over (Int or tuple of Ints)

Keyword Arguments

  • g=identity: Post-reduction transformation applied to each result element
  • workgroup=256: Workgroup size

Examples

x = CUDA.rand(Float32, 128, 64)
dst = CUDA.zeros(Float32, 1, 64)

# Sum along dim 1
mapreducedims!(identity, +, dst, x, 1)

# Sum of squares along dim 2 with pre-allocated dst
dst2 = CUDA.zeros(Float32, 128, 1)
mapreducedims!(x -> x^2, +, dst2, x, 2)

See also: KernelForge.mapreducedims for the allocating version.

source

Scan

KernelForge.scanFunction
scan(f, op, src; kwargs...) -> GPU array
scan(op, src; kwargs...) -> GPU array

GPU parallel prefix scan (cumulative reduction) using a decoupled lookback algorithm.

Applies f to each element, computes inclusive prefix scan with op, and optionally applies g to each output element.

Arguments

  • f: Map function applied to each element (defaults to identity)
  • op: Associative binary scan operator
  • src: Input GPU array

Keyword Arguments

  • g=identity: Post-scan transformation applied to each output element
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size

Examples

# Cumulative sum
x = CUDA.rand(Float32, 10_000)
result = scan(+, x)

# Cumulative sum of squares
result = scan(x -> x^2, +, x)

# With post-scan transformation
result = scan(+, x; g = sqrt)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)
result = scan(+, x; tmp)

See also: KernelForge.scan! for the in-place version.

source
KernelForge.scan!Function
scan!(f, op, dst, src; kwargs...)
scan!(op, dst, src; kwargs...)

In-place GPU parallel prefix scan using a decoupled lookback algorithm.

Applies f to each element, computes inclusive prefix scan with op, and optionally applies g to each output element, writing results to dst.

Arguments

  • f: Map function applied to each element (defaults to identity)
  • op: Associative binary scan operator
  • dst: Output array for scan results
  • src: Input GPU array

Keyword Arguments

  • g=identity: Post-scan transformation applied to each output element
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size

Examples

x = CUDA.rand(Float32, 10_000)
dst = similar(x)

# Cumulative sum
scan!(+, dst, x)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)
for i in 1:100
    scan!(+, dst, x; tmp)
end

See also: KernelForge.scan for the allocating version.

source
KernelForge.findfirstFunction
findfirst(filtr, src; kwargs...) -> Int or CartesianIndex or nothing

GPU parallel findfirst. Returns the index of the first element in src for which filtr returns true, or nothing if no such element exists. For multidimensional arrays, returns a CartesianIndex.

Arguments

  • filtr: Predicate function
  • src: Input GPU array

Keyword Arguments

  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks

Examples

x = adapt(backend, rand(Float32, 10_000))
findfirst(>(0.99f0), x)       # returns a linear index or nothing
A = adapt(backend, rand(Float32, 100, 100))
findfirst(>(0.99f0), A)       # returns a CartesianIndex or nothing

See also: KernelForge.findlast.

source
KernelForge.findlastFunction
findlast(filtr, src; kwargs...) -> Int or CartesianIndex or nothing

GPU parallel findlast. Returns the index of the last element in src for which filtr returns true, or nothing if no such element exists. Implemented by reversing src and delegating to KernelForge.findfirst, so it accepts the same keyword arguments. For multidimensional arrays, returns a CartesianIndex.

Arguments

  • filtr: Predicate function
  • src: Input GPU array

Keyword Arguments

  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks

Examples

x = adapt(backend, rand(Float32, 10_000))
findlast(>(0.99f0), x)        # returns a linear index or nothing
A = adapt(backend, rand(Float32, 100, 100))
findlast(>(0.99f0), A)        # returns a CartesianIndex or nothing

See also: KernelForge.findfirst.

source
KernelForge.argmax1dFunction
argmax1d(f, rel, src; kwargs...) -> Int or GPU array
argmax1d(f, rel, srcs::NTuple; kwargs...) -> Int or GPU array

GPU parallel argmax/argmin operation.

Applies f to each element, finds the extremum according to rel, and returns the index of the first extremal element. Ties are broken by smallest index.

Arguments

  • f: Map function applied to each element
  • rel: Comparison relation (> for argmax, < for argmin)
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks
  • to_cpu=true: If true, return scalar Int; otherwise return 1-element GPU array

Examples

x = CUDA.rand(Float32, 10_000)

# Argmax returning scalar index
idx = argmax1d(identity, >, x)

# Argmax returning 1-element GPU array
idx_gpu = argmax1d(identity, >, x; to_cpu=false)

# Argmin of absolute values
idx = argmax1d(abs, <, x)

See also: KernelForge.argmax1d! for the in-place version.

source
KernelForge.argmax1d!Function
argmax1d!(f, rel, dst, src; kwargs...)
argmax1d!(f, rel, dst, srcs::NTuple; kwargs...)

In-place GPU parallel argmax/argmin, writing the index to dst[1].

Ties are broken by smallest index.

Arguments

  • f: Map function applied to each element
  • rel: Comparison relation (> for argmax, < for argmin)
  • dst: Output array (index written to first element)
  • src or srcs: Input GPU array(s)

Keyword Arguments

  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Items per thread (auto-selected if nothing)
  • workgroup=256: Workgroup size
  • blocks=100: Number of blocks

Examples

x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Int, 1)

# Argmax index
argmax1d!(identity, >, dst, x)

# With pre-allocated temporary for repeated calls
tmp = KernelForge.get_allocation(Argmax1D, identity, x)
for i in 1:100
    argmax1d!(identity, >, dst, x; tmp)
end

See also: KernelForge.argmax1d for the allocating version.

source
KernelForge.argmaxFunction
argmax(rel, src::AbstractArray)

GPU parallel search returning the (value, index) pair of the element that is extremal according to the relation rel. Equivalent to argmax1d(identity, rel, src).

Arguments

  • rel: Comparison relation (e.g. > for maximum, < for minimum)
  • src: Input GPU array

Examples

x = CuArray([3f0, 1f0, 4f0, 1f0, 5f0])
argmax(>, x)  # returns (5f0, 5)
argmax(<, x)  # returns (1f0, 2)

See also: KernelForge.argmax1d, KernelForge.argmin.

source
argmax(src::AbstractArray)

GPU parallel argmax returning the (value, index) pair of the maximum element. Equivalent to argmax(>, src).

Examples

x = CuArray([3f0, 1f0, 4f0, 1f0, 5f0])
argmax(x)  # returns (5f0, 5)

See also: KernelForge.argmin, KernelForge.argmax1d.

source

Matrix-Vector

KernelForge.matvecFunction
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> GPU array
matvec!([f, op,] dst, src, x; kwargs...)

Generalized matrix-vector operation with customizable element-wise and reduction operations.

Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).

The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.

Arguments

  • f: Binary operation applied element-wise (default: *)
  • op: Reduction operation across columns (default: +)
  • dst: Output vector (in-place versions only)
  • src: Input matrix
  • x: Input vector, or nothing for row-wise reduction of src alone

Keyword Arguments

  • g=identity: Unary transformation applied to each reduced row
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • chunksz=nothing: Elements per thread (auto-tuned if nothing)
  • Nblocks=nothing: Number of thread blocks (auto-tuned if nothing)
  • workgroup=nothing: Threads per block (auto-tuned if nothing)
  • blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned if nothing.

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)

# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)

# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)

# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)

# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)

# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
for i in 1:100
    matvec!(dst, A, x; tmp)
end
source
KernelForge.matvec!Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> GPU array
matvec!([f, op,] dst, src, x; kwargs...)

Generalized matrix-vector operation with customizable element-wise and reduction operations.

Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).

The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.

Arguments

  • f: Binary operation applied element-wise (default: *)
  • op: Reduction operation across columns (default: +)
  • dst: Output vector (in-place versions only)
  • src: Input matrix
  • x: Input vector, or nothing for row-wise reduction of src alone

Keyword Arguments

  • g=identity: Unary transformation applied to each reduced row
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • chunksz=nothing: Elements per thread (auto-tuned if nothing)
  • Nblocks=nothing: Number of thread blocks (auto-tuned if nothing)
  • workgroup=nothing: Threads per block (auto-tuned if nothing)
  • blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned if nothing.

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)

# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)

# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)

# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)

# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)

# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
for i in 1:100
    matvec!(dst, A, x; tmp)
end
source
KernelForge.vecmat!Function
vecmat([f, op,] x, src; kwargs...) -> GPU array
vecmat!([f, op,] dst, x, src; kwargs...)

GPU parallel vector-matrix multiplication: dst = g(op(f(x .* A), dims=1)).

For standard matrix-vector product: vecmat!(dst, x, A) computes dst[j] = sum(x[i] * A[i,j]). When x = nothing, computes column reductions: dst[j] = sum(A[i,j]).

Arguments

  • f=*: Element-wise transformation applied to x[i] * A[i,j] (or A[i,j] if x=nothing)
  • op=+: Reduction operator
  • dst: Output vector of length p (number of columns)
  • x: Input vector of length n (number of rows), or nothing for pure column reduction
  • src: Input matrix of size (n, p)

Keyword Arguments

  • g=identity: Optional post-reduction transformation
  • tmp=nothing: Pre-allocated KernelBuffer (or nothing to allocate automatically)
  • Nitem=nothing: Number of items per thread (auto-selected if nothing)
  • Nthreads=nothing: Number of threads per column reduction
  • workgroup=nothing: Workgroup size
  • blocks=nothing: Maximum number of blocks

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 1000)

# Standard vector-matrix multiply: y = x' * A
y = vecmat(x, A)

# Column-wise sum: y[j] = sum(A[:, j])
y = vecmat(nothing, A)

# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(VecMat, *, +, x, A)
dst = CUDA.zeros(Float32, 500)
for i in 1:100
    vecmat!(dst, x, A; tmp)
end
source

Utilities

KernelForge.get_allocationFunction
get_allocation(::Type{MapReduce1D}, f, op, src_or_srcs, blocks=DEFAULT_BLOCKS)

Allocate a KernelBuffer for mapreduce1d!. Useful for repeated reductions.

Arguments

  • f: Map function (used to infer intermediate eltype)
  • op: Reduction operator
  • src_or_srcs: Input GPU array or NTuple of arrays (used to determine backend and eltype)
  • blocks: Number of blocks (must match blocks used in mapreduce1d!)

Returns

A KernelBuffer with named fields partial and flag (flags are UInt8).

Examples

x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(MapReduce1D, x -> x^2, +, x)
dst = CUDA.zeros(Float32, 1)

for i in 1:100
    mapreduce1d!(x -> x^2, +, dst, x; tmp)
end
source
get_allocation(::Type{VecMat}, f, op, x, src[, Nblocks]) -> KernelBuffer

Allocate a KernelBuffer for vecmat!. Useful for repeated calls.

Arguments

  • f: Map function (used to infer intermediate eltype)
  • op: Reduction operator
  • x: Input vector or nothing
  • src: Input GPU matrix (used to determine backend and eltype)
  • Nblocks: Number of blocks (auto-computed if omitted)

Returns

A KernelBuffer with named fields partial and flag.

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 1000)
tmp = KernelForge.get_allocation(VecMat, *, +, x, A)
dst = CUDA.zeros(Float32, 500)

for i in 1:100
    vecmat!(dst, x, A; tmp)
end
source
get_allocation(::Type{MatVec}, f, op, src, x[, Nblocks]) -> KernelBuffer

Allocate a KernelBuffer for matvec!. Useful for repeated calls.

Arguments

  • f: Map function (used to infer intermediate eltype)
  • op: Reduction operator
  • src: Input GPU matrix (used to determine backend and eltype)
  • x: Input vector or nothing
  • Nblocks: Number of blocks (auto-computed if omitted)

Returns

A KernelBuffer with named fields partial and flag.

Examples

A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
dst = CUDA.zeros(Float32, 1000)

for i in 1:100
    matvec!(dst, A, x; tmp)
end
source
get_allocation(::Type{Scan1D}, f, op, src[, blocks])

Allocate a KernelBuffer for scan!. Useful for repeated scans.

Arguments

  • f: Map function (used to infer intermediate eltype)
  • op: Reduction operator
  • src: Input GPU array (used to determine backend and eltype)
  • blocks: Number of blocks (auto-computed using default Nitem and DEFAULT_WORKGROUP if omitted)

Returns

A KernelBuffer with named fields partial1, partial2, and flag (flags are UInt8).

Examples

x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)

dst = similar(x)
for i in 1:100
    scan!(+, dst, x; tmp)
end
source
get_allocation(::Type{Argmax1D}, f, src; blocks=DEFAULT_BLOCKS)
get_allocation(::Type{Argmax1D}, f, srcs::NTuple; blocks=DEFAULT_BLOCKS)

Allocate a KernelBuffer for argmax1d!. Useful for repeated reductions.

The intermediate type is Tuple{H, Int} where H = promote_op(f, T), tracking both value and index.

Arguments

  • f: Map function (used to infer intermediate eltype)
  • src or srcs: Input GPU array(s) (used for backend and element type)

Keyword Arguments

  • blocks=DEFAULT_BLOCKS: Number of blocks (must match blocks used in argmax1d!)

Examples

x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(Argmax1D, identity, x)
dst = CUDA.zeros(Int, 1)

for i in 1:100
    argmax1d!(identity, >, dst, x; tmp)
end
source