API Reference
Copy
KernelForge.vcopy! — Function
vcopy!(dst::AbstractGPUVector, src::AbstractGPUVector; Nitem=4)Copy src to dst using vectorized GPU memory access.
Performs a high-throughput copy by loading and storing Nitem elements per thread, reducing memory transaction overhead compared to scalar copies.
Arguments
dst: Destination GPU vectorsrc: Source GPU vector (must have same length asdst)Nitem=4: Number of elements processed per thread. Higher values improve throughput but requirelength(src)to be divisible byNitem.
Example
src = CUDA.rand(Float32, 1024)
dst = CUDA.zeros(Float32, 1024)
vcopy!(dst, src)See also: KernelForge.setvalue!
KernelForge.setvalue! — Function
setvalue!(dst::AbstractGPUVector{T}, val::T; Nitem=4) where TFill dst with val using vectorized GPU memory access.
Performs a high-throughput fill by storing Nitem copies of val per thread, reducing memory transaction overhead compared to scalar writes.
Arguments
dst: Destination GPU vectorval: Value to fill (must match element type ofdst)Nitem=4: Number of elements written per thread. Higher values improve throughput but requirelength(dst)to be divisible byNitem.
Example
dst = CUDA.zeros(Float32, 1024)
setvalue!(dst, 1.0f0)See also: KernelForge.vcopy!
Map-Reduce
KernelForge.mapreduce — Function
mapreduce(f, op, src::AbstractArray; dims=nothing, kwargs...) -> scalar or GPU arrayGPU parallel map-reduce operation with optional dimension reduction.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatorsrc: Input GPU array
Keyword Arguments
dims=nothing: Dimensions to reduce over. Options:nothingor:: Reduce over all dimensions → scalarIntorTuple{Int...}: Reduce over those dims → GPU array
g=identity: Post-reduction transformationtmp=nothing: Pre-allocated buffer — aMapReduceBuffer(full reduction),VecMatBuffer(dim=1 reduction), orMatVecBuffer(dim=2 reduction)- Additional kwargs passed to underlying implementations
Fast paths
- Full reduction (
dims=nothing) →mapreduce1d - All dims explicit →
mapreducedims - Contiguous leading dims
(1,...,k)→ reshape,mapreduce2don dim 1, reshape back - Contiguous trailing dims
(k,...,n)→ reshape,mapreduce2don dim 2, reshape back - Both leading and trailing contiguous blocks → two
mapreduce2dpasses - General dims →
mapreducedimsfallback
Examples
A = CUDA.rand(Float32, 100, 50, 20)
# Full reduction → scalar
total = mapreduce(identity, +, A)
# Reduce along dim 1: (100, 50, 20) -> (1, 50, 20)
col_sums = mapreduce(identity, +, A; dims=1)
# Reduce along dims (1,2): (100, 50, 20) -> (1, 1, 20)
plane_sums = mapreduce(identity, +, A; dims=(1,2))
# Reduce along last dim: (100, 50, 20) -> (100, 50, 1)
depth_sums = mapreduce(identity, +, A; dims=3)See also: KernelForge.mapreduce!, mapreduce1d, mapreduce2d, mapreducedims
mapreduce(f, op, srcs::NTuple; kwargs...)Multi-array mapreduce. Only supports full reduction (dims=nothing).
Keyword Arguments
g=identity: Post-reduction transformationtmp=nothing: Pre-allocatedMapReduceBuffer- Additional kwargs passed to
mapreduce1d
KernelForge.mapreduce! — Function
mapreduce!(f, op, dst, src; dims=nothing, kwargs...)In-place GPU parallel map-reduce with dimension support.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatordst: Output arraysrc: Input GPU array
Keyword Arguments
dims=nothing: Dimensions to reduce over. Options:nothingor:: Reduce over all dimensions → writes todst[1]IntorTuple{Int...}: Reduce over those dims → writes todst
g=identity: Post-reduction transformationtmp=nothing: Pre-allocated buffer — aMapReduceBuffer(full reduction),VecMatBuffer(dim=1 reduction), orMatVecBuffer(dim=2 reduction)- Additional kwargs passed to underlying implementations
Examples
A = CUDA.rand(Float32, 100, 50)
dst = CUDA.zeros(Float32, 1, 50)
mapreduce!(identity, +, dst, A; dims=1)See also: KernelForge.mapreduce
KernelForge.mapreduce2d — Function
mapreduce2d(f, op, src, dim; kwargs...) -> VectorGPU parallel reduction along dimension dim.
dim=1: Column-wise reduction (vertical), output length = number of columnsdim=2: Row-wise reduction (horizontal), output length = number of rows
Arguments
f: Element-wise transformationop: Reduction operatorsrc: Input matrix of size(n, p)dim: Dimension to reduce along (1 or 2)
Keyword Arguments
g=identity: Post-reduction transformationtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)
For dim=1 (column-wise):
Nitem=nothing: Items per threadNthreads=nothing: Threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Number of blocks
For dim=2 (row-wise):
chunksz=nothing: Chunk size for row processingNblocks=nothing: Number of blocks per rowworkgroup=nothing: Workgroup sizeblocks_row=nothing: Blocks per row
Examples
A = CUDA.rand(Float32, 1000, 500)
# Column sums (reduce along dim=1)
col_sums = mapreduce2d(identity, +, A, 1)
# Row maximums (reduce along dim=2)
row_maxs = mapreduce2d(identity, max, A, 2)
# Column means
col_means = mapreduce2d(identity, +, A, 1; g=x -> x / size(A, 1))
# Sum of squares per row
row_ss = mapreduce2d(abs2, +, A, 2)See also: KernelForge.mapreduce2d! for the in-place version.
KernelForge.mapreduce2d! — Function
mapreduce2d!(f, op, dst, src, dim; kwargs...)In-place GPU parallel reduction along dimension dim.
dim=1: Column-wise reduction (vertical),dstlength = number of columnsdim=2: Row-wise reduction (horizontal),dstlength = number of rows
Arguments
f: Element-wise transformationop: Reduction operatordst: Output vectorsrc: Input matrix of size(n, p)dim: Dimension to reduce along (1 or 2)
Keyword Arguments
g=identity: Post-reduction transformationtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)
For dim=1 (column-wise):
Nitem=nothing: Items per threadNthreads=nothing: Threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Number of blocks
For dim=2 (row-wise):
chunksz=nothing: Chunk size for row processingNblocks=nothing: Number of blocks per rowworkgroup=nothing: Workgroup sizeblocks_row=nothing: Blocks per row
Examples
A = CUDA.rand(Float32, 1000, 500)
col_sums = CUDA.zeros(Float32, 500)
row_maxs = CUDA.zeros(Float32, 1000)
# Column sums
mapreduce2d!(identity, +, col_sums, A, 1)
# Row maximums
mapreduce2d!(identity, max, row_maxs, A, 2)See also: KernelForge.mapreduce2d for the allocating version.
KernelForge.mapreduce1d — Function
mapreduce1d(f, op, src; kwargs...) -> scalar or GPU array
mapreduce1d(f, op, srcs::NTuple; kwargs...) -> scalar or GPU arrayGPU parallel map-reduce operation.
Applies f to each element, reduces with op, and optionally applies g to the final result.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatorsrcorsrcs: Input GPU array(s)
Keyword Arguments
g=identity: Post-reduction transformation applied to final resulttmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocksto_cpu=true: If true, return scalar; otherwise return 1-element GPU array
Examples
# Sum of squares
x = CUDA.rand(Float32, 10_000)
result = mapreduce1d(x -> x^2, +, x)
# Return GPU array instead of scalar
result = mapreduce1d(x -> x^2, +, x; to_cpu=false)
# Dot product of two arrays
x, y = CUDA.rand(Float32, 10_000), CUDA.rand(Float32, 10_000)
result = mapreduce1d((a, b) -> a * b, +, (x, y))See also: KernelForge.mapreduce1d! for the in-place version.
KernelForge.mapreduce1d! — Function
mapreduce1d!(f, op, dst, src; kwargs...)
mapreduce1d!(f, op, dst, srcs::NTuple; kwargs...)In-place GPU parallel map-reduce, writing result to dst[1].
Arguments
f: Map function applied to each elementop: Associative binary reduction operatordst: Output array (result written to first element)srcorsrcs: Input GPU array(s)
Keyword Arguments
g=identity: Post-reduction transformation applied to final resulttmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocks
Examples
x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Float32, 1)
# Sum
mapreduce1d!(identity, +, dst, x)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MapReduce1D, x -> x^2, +, x)
for i in 1:100
mapreduce1d!(x -> x^2, +, dst, x; tmp)
endSee also: KernelForge.mapreduce1d for the allocating version.
KernelForge.mapreducedims — Function
mapreducedims(f, op, src, dims; kwargs...) -> GPU arrayGPU parallel map-reduce over specified dimensions.
Applies f to each element, reduces along dims with op, and optionally applies g to each final element.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatorsrc: Input GPU arraydims: Dimension(s) to reduce over (Int or tuple of Ints)
Keyword Arguments
g=identity: Post-reduction transformation applied to each result elementworkgroup=256: Workgroup size
Examples
# Sum along rows (reduce dim 1)
x = CUDA.rand(Float32, 128, 64)
result = mapreducedims(identity, +, x, 1) # shape: (1, 64)
# Sum of squares along columns (reduce dim 2)
result = mapreducedims(x -> x^2, +, x, 2) # shape: (128, 1)
# Reduce multiple dimensions
x = CUDA.rand(Float32, 4, 8, 16)
result = mapreducedims(identity, +, x, (1, 3)) # shape: (1, 8, 1)See also: KernelForge.mapreducedims! for the in-place version.
KernelForge.mapreducedims! — Function
mapreducedims!(f, op, dst, src, dims; kwargs...)In-place GPU parallel map-reduce over specified dimensions, writing result to dst.
Arguments
f: Map function applied to each elementop: Associative binary reduction operatordst: Output array (must have size 1 along each reduced dimension)src: Input GPU arraydims: Dimension(s) to reduce over (Int or tuple of Ints)
Keyword Arguments
g=identity: Post-reduction transformation applied to each result elementworkgroup=256: Workgroup size
Examples
x = CUDA.rand(Float32, 128, 64)
dst = CUDA.zeros(Float32, 1, 64)
# Sum along dim 1
mapreducedims!(identity, +, dst, x, 1)
# Sum of squares along dim 2 with pre-allocated dst
dst2 = CUDA.zeros(Float32, 128, 1)
mapreducedims!(x -> x^2, +, dst2, x, 2)See also: KernelForge.mapreducedims for the allocating version.
Scan
KernelForge.scan — Function
scan(f, op, src; kwargs...) -> GPU array
scan(op, src; kwargs...) -> GPU arrayGPU parallel prefix scan (cumulative reduction) using a decoupled lookback algorithm.
Applies f to each element, computes inclusive prefix scan with op, and optionally applies g to each output element.
Arguments
f: Map function applied to each element (defaults toidentity)op: Associative binary scan operatorsrc: Input GPU array
Keyword Arguments
g=identity: Post-scan transformation applied to each output elementtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup size
Examples
# Cumulative sum
x = CUDA.rand(Float32, 10_000)
result = scan(+, x)
# Cumulative sum of squares
result = scan(x -> x^2, +, x)
# With post-scan transformation
result = scan(+, x; g = sqrt)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)
result = scan(+, x; tmp)See also: KernelForge.scan! for the in-place version.
KernelForge.scan! — Function
scan!(f, op, dst, src; kwargs...)
scan!(op, dst, src; kwargs...)In-place GPU parallel prefix scan using a decoupled lookback algorithm.
Applies f to each element, computes inclusive prefix scan with op, and optionally applies g to each output element, writing results to dst.
Arguments
f: Map function applied to each element (defaults toidentity)op: Associative binary scan operatordst: Output array for scan resultssrc: Input GPU array
Keyword Arguments
g=identity: Post-scan transformation applied to each output elementtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup size
Examples
x = CUDA.rand(Float32, 10_000)
dst = similar(x)
# Cumulative sum
scan!(+, dst, x)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)
for i in 1:100
scan!(+, dst, x; tmp)
endSee also: KernelForge.scan for the allocating version.
Search
KernelForge.findfirst — Function
findfirst(filtr, src; kwargs...) -> Int or CartesianIndex or nothingGPU parallel findfirst. Returns the index of the first element in src for which filtr returns true, or nothing if no such element exists. For multidimensional arrays, returns a CartesianIndex.
Arguments
filtr: Predicate functionsrc: Input GPU array
Keyword Arguments
Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocks
Examples
x = adapt(backend, rand(Float32, 10_000))
findfirst(>(0.99f0), x) # returns a linear index or nothing
A = adapt(backend, rand(Float32, 100, 100))
findfirst(>(0.99f0), A) # returns a CartesianIndex or nothingSee also: KernelForge.findlast.
KernelForge.findlast — Function
findlast(filtr, src; kwargs...) -> Int or CartesianIndex or nothingGPU parallel findlast. Returns the index of the last element in src for which filtr returns true, or nothing if no such element exists. Implemented by reversing src and delegating to KernelForge.findfirst, so it accepts the same keyword arguments. For multidimensional arrays, returns a CartesianIndex.
Arguments
filtr: Predicate functionsrc: Input GPU array
Keyword Arguments
Nitem=nothing: Items per thread (auto-selected if nothing)workgroup=256: Workgroup sizeblocks=100: Number of blocks
Examples
x = adapt(backend, rand(Float32, 10_000))
findlast(>(0.99f0), x) # returns a linear index or nothing
A = adapt(backend, rand(Float32, 100, 100))
findlast(>(0.99f0), A) # returns a CartesianIndex or nothingSee also: KernelForge.findfirst.
KernelForge.argmax1d — Function
argmax1d(f, rel, src; kwargs...) -> Int or GPU array
argmax1d(f, rel, srcs::NTuple; kwargs...) -> Int or GPU arrayGPU parallel argmax/argmin operation.
Applies f to each element, finds the extremum according to rel, and returns the index of the first extremal element. Ties are broken by smallest index.
Arguments
f: Map function applied to each elementrel: Comparison relation (>for argmax,<for argmin)srcorsrcs: Input GPU array(s)
Keyword Arguments
tmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected ifnothing)workgroup=256: Workgroup sizeblocks=100: Number of blocksto_cpu=true: Iftrue, return scalarInt; otherwise return 1-element GPU array
Examples
x = CUDA.rand(Float32, 10_000)
# Argmax returning scalar index
idx = argmax1d(identity, >, x)
# Argmax returning 1-element GPU array
idx_gpu = argmax1d(identity, >, x; to_cpu=false)
# Argmin of absolute values
idx = argmax1d(abs, <, x)See also: KernelForge.argmax1d! for the in-place version.
KernelForge.argmax1d! — Function
argmax1d!(f, rel, dst, src; kwargs...)
argmax1d!(f, rel, dst, srcs::NTuple; kwargs...)In-place GPU parallel argmax/argmin, writing the index to dst[1].
Ties are broken by smallest index.
Arguments
f: Map function applied to each elementrel: Comparison relation (>for argmax,<for argmin)dst: Output array (index written to first element)srcorsrcs: Input GPU array(s)
Keyword Arguments
tmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Items per thread (auto-selected ifnothing)workgroup=256: Workgroup sizeblocks=100: Number of blocks
Examples
x = CUDA.rand(Float32, 10_000)
dst = CUDA.zeros(Int, 1)
# Argmax index
argmax1d!(identity, >, dst, x)
# With pre-allocated temporary for repeated calls
tmp = KernelForge.get_allocation(Argmax1D, identity, x)
for i in 1:100
argmax1d!(identity, >, dst, x; tmp)
endSee also: KernelForge.argmax1d for the allocating version.
KernelForge.argmax — Function
argmax(rel, src::AbstractArray)GPU parallel search returning the (value, index) pair of the element that is extremal according to the relation rel. Equivalent to argmax1d(identity, rel, src).
Arguments
rel: Comparison relation (e.g.>for maximum,<for minimum)src: Input GPU array
Examples
x = CuArray([3f0, 1f0, 4f0, 1f0, 5f0])
argmax(>, x) # returns (5f0, 5)
argmax(<, x) # returns (1f0, 2)See also: KernelForge.argmax1d, KernelForge.argmin.
argmax(src::AbstractArray)GPU parallel argmax returning the (value, index) pair of the maximum element. Equivalent to argmax(>, src).
Examples
x = CuArray([3f0, 1f0, 4f0, 1f0, 5f0])
argmax(x) # returns (5f0, 5)See also: KernelForge.argmin, KernelForge.argmax1d.
KernelForge.argmin — Function
argmin(src::AbstractArray)GPU parallel argmin returning the (value, index) pair of the minimum element. Equivalent to argmax(<, src).
Examples
x = CuArray([3f0, 1f0, 4f0, 1f0, 5f0])
argmin(x) # returns (1f0, 2)See also: KernelForge.argmax, KernelForge.argmax1d.
Matrix-Vector
KernelForge.matvec — Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> GPU array
matvec!([f, op,] dst, src, x; kwargs...)Generalized matrix-vector operation with customizable element-wise and reduction operations.
Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).
The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.
Arguments
f: Binary operation applied element-wise (default:*)op: Reduction operation across columns (default:+)dst: Output vector (in-place versions only)src: Input matrixx: Input vector, ornothingfor row-wise reduction ofsrcalone
Keyword Arguments
g=identity: Unary transformation applied to each reduced rowtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)chunksz=nothing: Elements per thread (auto-tuned ifnothing)Nblocks=nothing: Number of thread blocks (auto-tuned ifnothing)workgroup=nothing: Threads per block (auto-tuned ifnothing)blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned ifnothing.
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)
# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)
# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)
# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)
# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
for i in 1:100
matvec!(dst, A, x; tmp)
endKernelForge.matvec! — Function
matvec([f, op,] src::AbstractMatrix, x; kwargs...) -> GPU array
matvec!([f, op,] dst, src, x; kwargs...)Generalized matrix-vector operation with customizable element-wise and reduction operations.
Computes dst[i] = g(op_j(f(src[i,j], x[j]))) for each row i, where op_j denotes reduction over columns. For standard matrix-vector multiplication, this is dst[i] = sum_j(src[i,j] * x[j]).
The allocating version matvec returns a newly allocated result vector. The in-place version matvec! writes to dst.
Arguments
f: Binary operation applied element-wise (default:*)op: Reduction operation across columns (default:+)dst: Output vector (in-place versions only)src: Input matrixx: Input vector, ornothingfor row-wise reduction ofsrcalone
Keyword Arguments
g=identity: Unary transformation applied to each reduced rowtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)chunksz=nothing: Elements per thread (auto-tuned ifnothing)Nblocks=nothing: Number of thread blocks (auto-tuned ifnothing)workgroup=nothing: Threads per block (auto-tuned ifnothing)blocks_row=nothing: Number of blocks used to process a single row; relevant only for wide matrices (many columns, few rows) where parallelizing across columns is beneficial. Auto-tuned ifnothing.
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
# Standard matrix-vector multiply: y = A * x
y = matvec(A, x)
# Row-wise sum: y[i] = sum(A[i, :])
y = matvec(A, nothing)
# Row-wise maximum: y[i] = max_j(A[i, j])
y = matvec(identity, max, A, nothing)
# Softmax numerator: y[i] = sum_j(exp(A[i,j] - x[j]))
y = matvec((a, b) -> exp(a - b), +, A, x)
# In-place version
dst = CUDA.zeros(Float32, 1000)
matvec!(dst, A, x)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
for i in 1:100
matvec!(dst, A, x; tmp)
endKernelForge.vecmat! — Function
vecmat([f, op,] x, src; kwargs...) -> GPU array
vecmat!([f, op,] dst, x, src; kwargs...)GPU parallel vector-matrix multiplication: dst = g(op(f(x .* A), dims=1)).
For standard matrix-vector product: vecmat!(dst, x, A) computes dst[j] = sum(x[i] * A[i,j]). When x = nothing, computes column reductions: dst[j] = sum(A[i,j]).
Arguments
f=*: Element-wise transformation applied tox[i] * A[i,j](orA[i,j]ifx=nothing)op=+: Reduction operatordst: Output vector of lengthp(number of columns)x: Input vector of lengthn(number of rows), ornothingfor pure column reductionsrc: Input matrix of size(n, p)
Keyword Arguments
g=identity: Optional post-reduction transformationtmp=nothing: Pre-allocatedKernelBuffer(ornothingto allocate automatically)Nitem=nothing: Number of items per thread (auto-selected if nothing)Nthreads=nothing: Number of threads per column reductionworkgroup=nothing: Workgroup sizeblocks=nothing: Maximum number of blocks
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 1000)
# Standard vector-matrix multiply: y = x' * A
y = vecmat(x, A)
# Column-wise sum: y[j] = sum(A[:, j])
y = vecmat(nothing, A)
# With pre-allocated buffer for repeated calls
tmp = KernelForge.get_allocation(VecMat, *, +, x, A)
dst = CUDA.zeros(Float32, 500)
for i in 1:100
vecmat!(dst, x, A; tmp)
endUtilities
KernelForge.get_allocation — Function
get_allocation(::Type{MapReduce1D}, f, op, src_or_srcs, blocks=DEFAULT_BLOCKS)Allocate a KernelBuffer for mapreduce1d!. Useful for repeated reductions.
Arguments
f: Map function (used to infer intermediate eltype)op: Reduction operatorsrc_or_srcs: Input GPU array or NTuple of arrays (used to determine backend and eltype)blocks: Number of blocks (must matchblocksused inmapreduce1d!)
Returns
A KernelBuffer with named fields partial and flag (flags are UInt8).
Examples
x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(MapReduce1D, x -> x^2, +, x)
dst = CUDA.zeros(Float32, 1)
for i in 1:100
mapreduce1d!(x -> x^2, +, dst, x; tmp)
endget_allocation(::Type{VecMat}, f, op, x, src[, Nblocks]) -> KernelBufferAllocate a KernelBuffer for vecmat!. Useful for repeated calls.
Arguments
f: Map function (used to infer intermediate eltype)op: Reduction operatorx: Input vector ornothingsrc: Input GPU matrix (used to determine backend and eltype)Nblocks: Number of blocks (auto-computed if omitted)
Returns
A KernelBuffer with named fields partial and flag.
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 1000)
tmp = KernelForge.get_allocation(VecMat, *, +, x, A)
dst = CUDA.zeros(Float32, 500)
for i in 1:100
vecmat!(dst, x, A; tmp)
endget_allocation(::Type{MatVec}, f, op, src, x[, Nblocks]) -> KernelBufferAllocate a KernelBuffer for matvec!. Useful for repeated calls.
Arguments
f: Map function (used to infer intermediate eltype)op: Reduction operatorsrc: Input GPU matrix (used to determine backend and eltype)x: Input vector ornothingNblocks: Number of blocks (auto-computed if omitted)
Returns
A KernelBuffer with named fields partial and flag.
Examples
A = CUDA.rand(Float32, 1000, 500)
x = CUDA.rand(Float32, 500)
tmp = KernelForge.get_allocation(MatVec, *, +, A, x)
dst = CUDA.zeros(Float32, 1000)
for i in 1:100
matvec!(dst, A, x; tmp)
endget_allocation(::Type{Scan1D}, f, op, src[, blocks])Allocate a KernelBuffer for scan!. Useful for repeated scans.
Arguments
f: Map function (used to infer intermediate eltype)op: Reduction operatorsrc: Input GPU array (used to determine backend and eltype)blocks: Number of blocks (auto-computed using defaultNitemandDEFAULT_WORKGROUPif omitted)
Returns
A KernelBuffer with named fields partial1, partial2, and flag (flags are UInt8).
Examples
x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(Scan1D, identity, +, x)
dst = similar(x)
for i in 1:100
scan!(+, dst, x; tmp)
endget_allocation(::Type{Argmax1D}, f, src; blocks=DEFAULT_BLOCKS)
get_allocation(::Type{Argmax1D}, f, srcs::NTuple; blocks=DEFAULT_BLOCKS)Allocate a KernelBuffer for argmax1d!. Useful for repeated reductions.
The intermediate type is Tuple{H, Int} where H = promote_op(f, T), tracking both value and index.
Arguments
f: Map function (used to infer intermediate eltype)srcorsrcs: Input GPU array(s) (used for backend and element type)
Keyword Arguments
blocks=DEFAULT_BLOCKS: Number of blocks (must matchblocksused inargmax1d!)
Examples
x = CUDA.rand(Float32, 10_000)
tmp = KernelForge.get_allocation(Argmax1D, identity, x)
dst = CUDA.zeros(Int, 1)
for i in 1:100
argmax1d!(identity, >, dst, x; tmp)
end