Software
While my primary research is in statistics, I build high-performance computing tools in Julia, targeting both GPU and CPU. Much of this work centers on parallel primitives — reductions, scans and mapreduce — that underpin everything from dot products to matrix–vector multiplication over arbitrary vector spaces.
mapreduce, accumulate) aiming to match vendor libraries like NVIDIA's CUB/CUBLAS while staying cross-platform.
Julia
GPU
Active
18 stars
Docs
GitHub
mapreduce
A small side experiment: a tree-based mapreduce for the CPU that edges past Julia Base while keeping good floating-point precision.
Julia
CPU
Performance notebook
KernelForge.jl
The goal of KernelForge.jl is to provide efficient, portable implementations of parallel primitives that match the performance of vendor-optimized libraries like NVIDIA’s CUB or CUBLAS, while staying cross-platform.
This is made possible by Julia’s design. Users can define custom operators such as + and *, enabling reductions and matrix–vector multiplication over arbitrary vector spaces with concrete element types (fixed-size types like QuaternionsF32 or NTuple{2, Float64}). On top of that, KernelAbstractions.jl lets a single @kernel compile to different low-level targets — PTX for NVIDIA, SPIR-V for Intel, or GCN for AMD — at minimal development cost.
Live benchmarks
KernelForge against vendor libraries (CUB / rocPRIM), KernelAbstractions (AK) and CUDA / Base — pulled live from the benchmark repo. Pick a GPU, operation, element type and metric; hover a point for the exact number.
KernelIntrinsics.jl
KernelIntrinsics.jl is the lower-level package KernelForge builds on, giving fine-grained control over memory ordering, synchronization and vectorized access from within Julia kernels:
- Memory fences & ordered access — explicit acquire/release semantics via
@fenceand@accessfor correct multi-threaded synchronization - Warp-level operations — shuffles (
@shfl), inclusive scans (@warpreduce) and reductions (@warpfold) for efficient intra-warp communication - Vectorized memory transactions — hardware-accelerated vector loads and stores (
vload,vstore!) for higher memory bandwidth
The current implementation targets CUDA GPUs via PTX instructions, but the macro-based API is designed for portability to other backends.
CPU mapreduce
A smaller side experiment: I looked at Julia’s mapreduce on the CPU. A tree-based version lands a modest improvement over Base while keeping precision under control.