KernelIntrinsics.jl
Despite its name, this package is not affiliated with or part of KernelAbstractions.jl. The name collision is a known issue. A future goal is to upstream its functionality into KernelAbstractions.jl and UnsafeAtomics.jl.
This package provides low-level GPU primitives intended for library developers, not end users. If you're looking for high-level GPU programming in Julia, use CUDA.jl or KernelAbstractions.jl directly. KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for other packages.
- No bounds checking for
vload/vstore!on dense arrays — out-of-bounds access will cause undefined behavior - CUDA, ROCm, and Metal backends supported; oneAPI planned
Features
- Memory Fences: Explicit memory synchronization with
@fencemacro - Ordered Memory Access: Acquire/release semantics with
@accessmacro - Warp Operations: Efficient intra-warp communication and reduction primitives
@shfl: Warp shuffle operations (Up, Down, Xor, Idx modes)@warpreduce: Inclusive scan within a warp@warpfold: Warp-wide reduction to a single value
- Vectorized Memory Operations: Hardware-accelerated vector loads and stores with
vloadandvstore!. Two indexing modes are available: rebased (default) whereidxselects a contiguous block ofNitemelements ((idx-1)*Nitem+1 … idx*Nitem), and direct whereidxis the literal starting index. Array views (SubArray) are supported and fall back to scalar tuple operations automatically.
Cross-Architecture Support
KernelIntrinsics.jl currently supports three backends:
- CUDA via
CUDABackend— PTX instructions for fences (fence.acq_rel.gpu), ordered access (ld.acquire.gpu,st.release.gpu), warp shuffles, and vectorized loads/stores. - ROCm via
ROCBackend— equivalent primitives using AMDGPU-specific intrinsics. - Metal via
MetalBackend— Apple GPU support contributed by @WilliBee (PR #5). Maps to AIR intrinsics (air.atomic.fence,air.simd_*); subject to Metal's narrower atomic type set (Int32/UInt32/Float32).
The macro-based API is designed with portability in mind. Future releases may extend support to additional GPU backends (Intel oneAPI). Contributions are welcome.
Installation
using Pkg
Pkg.add("KernelIntrinsics")Quick Start
using KernelIntrinsics
using KernelAbstractions, CUDA
@kernel function example_kernel(X, Flag)
X[1] = 10
@fence # Ensure X[1]=10 is visible to all threads
@access Flag[1] = 1 # Release store
end
X = cu([1])
Flag = cu([0])
example_kernel(CUDABackend())(X, Flag; ndrange=1)Memory Ordering Semantics
- @fence: Full acquire-release fence across all device threads
- @access Release: Ensures prior writes are visible before the store
- @access Acquire: Ensures subsequent reads see prior writes
- @access Acquire Device: Explicitly specifies device scope (default)
Performance Considerations
- Warp operations are most efficient when all threads in a warp participate
- Vectorized operations can significantly improve memory bandwidth utilization
- Use the minimum required memory ordering (acquire/release over fences when possible)
- Default warp size is 32 for CUDA and 64 for ROCm; operations assume full warp/wavefront participation
- When
vload/vstore!cannot prove pointer alignment at compile time, they emit a small runtime alignment dispatch — negligible compared to memory latency, but providing theAlignmentargument removes it vload/vstore!on array views fall back to scalar tuple operations; performance-critical code should prefer contiguous arrays for vectorized instructions
Implementation Notes
Memory Ordering and Scopes
The implementation of memory fences, orderings, and scopes is inspired by UnsafeAtomics.jl. This package includes tests demonstrating that these primitives work correctly on CUDA and ROCm, generating the expected backend-specific instructions.
Warp Shuffle Operations
The warp shuffle implementation builds upon CUDA.jl's approach but generalizes it to support any concrete bitstype struct, including nested and composite types, as well as NTuples. The backend only needs to implement 32-bit shuffle operations; larger types are automatically decomposed. On ROCm, wavefront-level shuffle operations are used as the equivalent primitive.
Vectorized Memory Access
Vectorized loads and stores (vload, vstore!) use LLVM intrinsic functions to generate efficient vector instructions (ld.global.v4, st.global.v4 on CUDA, and their ROCm/Metal equivalents) on contiguous arrays. Two indexing modes are provided:
- Rebased (
Val(true), default):idxis a block index —vload(A, i, Val(N))loadsA[(i-1)*N+1 : i*N]. This is the form to reach for when you tile data into fixed-size blocks per thread; when the array's base pointer isN-aligned, it lowers to a single aligned vector load. - Direct (
Val(false)):idxis the literal starting index —vload(A, i, Val(N), Val(false))loadsA[i : i+N-1]. Use this when the start position is data-dependent and not a multiple ofN.
Both modes handle unknown alignment at runtime via an internal pattern-based dispatch, so correctness is preserved without the user managing alignment manually. Array views (SubArray) are fully supported via an automatic fallback to scalar tuple operations, ensuring correctness at the cost of vectorization.
Current limitations:
- No bounds checking for
vload/vstore!on dense arrays — out-of-bounds access will cause undefined behavior - Future versions may add vectorized support for non-contiguous views
Requirements
- Julia 1.10+
- KernelAbstractions.jl
- CUDA.jl (for CUDA backend)
- AMDGPU.jl (for ROCm backend)
- Metal.jl (for Metal backend)
Contents
License
MIT License