Software

While my primary research is in statistics, I build high-performance computing tools in Julia, targeting both GPU and CPU. Much of this work centers on parallel primitives — reductions, scans and mapreduce — that underpin everything from dot products to matrix–vector multiplication over arbitrary vector spaces.

🔥 KernelForge.jl Portable, high-performance GPU primitives (mapreduce, accumulate) aiming to match vendor libraries like NVIDIA's CUB/CUBLAS while staying cross-platform. Julia GPU Active 18 stars Docs GitHub
⚙️ KernelIntrinsics.jl Low-level GPU building blocks for KernelAbstractions.jl: memory fences, warp-level shuffles/reductions, and vectorized loads & stores. Julia CUDA / PTX Active 10 stars Docs GitHub
🖥️ CPU mapreduce A small side experiment: a tree-based mapreduce for the CPU that edges past Julia Base while keeping good floating-point precision. Julia CPU Performance notebook

KernelForge.jl

The goal of KernelForge.jl is to provide efficient, portable implementations of parallel primitives that match the performance of vendor-optimized libraries like NVIDIA’s CUB or CUBLAS, while staying cross-platform.

This is made possible by Julia’s design. Users can define custom operators such as + and *, enabling reductions and matrix–vector multiplication over arbitrary vector spaces with concrete element types (fixed-size types like QuaternionsF32 or NTuple{2, Float64}). On top of that, KernelAbstractions.jl lets a single @kernel compile to different low-level targets — PTX for NVIDIA, SPIR-V for Intel, or GCN for AMD — at minimal development cost.

Live benchmarks

KernelForge against vendor libraries (CUB / rocPRIM), KernelAbstractions (AK) and CUDA / Base — pulled live from the benchmark repo. Pick a GPU, operation, element type and metric; hover a point for the exact number.

KernelIntrinsics.jl

KernelIntrinsics.jl is the lower-level package KernelForge builds on, giving fine-grained control over memory ordering, synchronization and vectorized access from within Julia kernels:

  • Memory fences & ordered access — explicit acquire/release semantics via @fence and @access for correct multi-threaded synchronization
  • Warp-level operations — shuffles (@shfl), inclusive scans (@warpreduce) and reductions (@warpfold) for efficient intra-warp communication
  • Vectorized memory transactions — hardware-accelerated vector loads and stores (vload, vstore!) for higher memory bandwidth

The current implementation targets CUDA GPUs via PTX instructions, but the macro-based API is designed for portability to other backends.

CPU mapreduce

A smaller side experiment: I looked at Julia’s mapreduce on the CPU. A tree-based version lands a modest improvement over Base while keeping precision under control.

View the CPU mapreduce performance notebook →