KernelIntrinsics.jl
Despite its name, this package is not affiliated with or part of KernelAbstractions.jl. The name collision is a known issue. Currently, KernelIntrinsics.jl is primarily used as a building block for KernelForge.jl. A future goal is to upstream its functionality into KernelAbstractions.jl and UnsafeAtomics.jl.
This package provides low-level GPU primitives intended for library developers, not end users. If you're looking for high-level GPU programming in Julia, use CUDA.jl or KernelAbstractions.jl directly. KernelIntrinsics.jl is similar in scope to GPUArraysCore.jl — a building block for other packages.
- No bounds checking for
vload/vstore!on dense arrays — out-of-bounds access will cause undefined behavior - CUDA and ROCm backends supported; other backends planned
Features
- Memory Fences: Explicit memory synchronization with
@fencemacro - Ordered Memory Access: Acquire/release semantics with
@accessmacro - Warp Operations: Efficient intra-warp communication and reduction primitives
@shfl: Warp shuffle operations (Up, Down, Xor, Idx modes)@warpreduce: Inclusive scan within a warp@warpfold: Warp-wide reduction to a single value
- Vectorized Memory Operations: Hardware-accelerated vector loads and stores with
vload,vstore!,vload_multi, andvstore_multi!. Array views (SubArray) are supported and fall back to scalar tuple operations automatically.
Cross-Architecture Support
KernelIntrinsics.jl currently supports CUDA GPUs via the CUDABackend and AMD GPUs via the ROCBackend. The CUDA backend leverages PTX instructions for memory fences (fence.acq_rel.gpu), ordered memory access (ld.acquire.gpu, st.release.gpu), warp shuffle operations, and vectorized memory transactions. The ROCm backend provides equivalent primitives using AMDGPU-specific intrinsics.
The macro-based API is designed with portability in mind. Future releases may extend support to additional GPU backends (Intel oneAPI, Apple Metal). Contributions to enable further cross-platform support are welcome.
Installation
using Pkg
Pkg.add("KernelIntrinsics")Quick Start
using KernelIntrinsics
using KernelAbstractions, CUDA
@kernel function example_kernel(X, Flag)
X[1] = 10
@fence # Ensure X[1]=10 is visible to all threads
@access Flag[1] = 1 # Release store
end
X = cu([1])
Flag = cu([0])
example_kernel(CUDABackend())(X, Flag; ndrange=1)Memory Ordering Semantics
- @fence: Full acquire-release fence across all device threads
- @access Release: Ensures prior writes are visible before the store
- @access Acquire: Ensures subsequent reads see prior writes
- @access Acquire Device: Explicitly specifies device scope (default)
Performance Considerations
- Warp operations are most efficient when all threads in a warp participate
- Vectorized operations can significantly improve memory bandwidth utilization
- Use the minimum required memory ordering (acquire/release over fences when possible)
- Default warp size is 32 for CUDA and 64 for ROCm; operations assume full warp/wavefront participation
vload_multi/vstore_multi!have a small runtime overhead for the alignment switch, but this is typically negligible compared to memory latencyvload/vstore!on array views fall back to scalar tuple operations; performance-critical code should prefer contiguous arrays for vectorized instructions
Implementation Notes
Memory Ordering and Scopes
The implementation of memory fences, orderings, and scopes is inspired by UnsafeAtomics.jl. This package includes tests demonstrating that these primitives work correctly on CUDA and ROCm, generating the expected backend-specific instructions.
Warp Shuffle Operations
The warp shuffle implementation builds upon CUDA.jl's approach but generalizes it to support any concrete bitstype struct, including nested and composite types, as well as NTuples. The backend only needs to implement 32-bit shuffle operations; larger types are automatically decomposed. On ROCm, wavefront-level shuffle operations are used as the equivalent primitive.
Vectorized Memory Access
Vectorized loads and stores (vload, vstore!, vload_multi, vstore_multi!) use LLVM intrinsic functions to generate efficient vector instructions (ld.global.v4, st.global.v4 on CUDA, and their ROCm equivalents) on contiguous arrays. Array views (SubArray) are fully supported via an automatic fallback to scalar tuple operations, ensuring correctness at the cost of vectorization.
Current limitations:
- No bounds checking for
vload/vstore!on dense arrays — out-of-bounds access will cause undefined behavior - Future versions may add vectorized support for non-contiguous views
Requirements
- Julia 1.10+
- KernelAbstractions.jl
- CUDA.jl (for CUDA backend)
- AMDGPU.jl (for ROCm backend)
Contents
License
MIT License