Exposing new GPU intrinsics

This tutorial walks through how to expose a new NVPTX intrinsic in CUDA.jl, using the directed-rounding floating-point operations as a worked example.

Identifying the intrinsic

The PTX ISA documents the available instructions at PTX – Floating Point Instructions. Each floating-point arithmetic instruction comes in four directed-rounding variants:

{add,sub,mul,div,fma}.rnd.{f32,f64}      rnd ∈ {.rn, .rz, .rm, .rp}

where .rn is round-to-nearest-even, .rz rounds toward zero, .rm rounds toward minus infinity, and .rp rounds toward plus infinity.

Most of these PTX instructions are exposed by LLVM as NVVM intrinsics. The canonical list lives in llvm/include/llvm/IR/IntrinsicsNVVM.td. The naming convention there maps .f32 to .f and .f64 to .d, so the round-toward-plus-infinity FMA for Float64 is llvm.nvvm.fma.rp.d.

If LLVM does not expose an intrinsic for the PTX instruction you need, the fallback is to drop down to inline PTX assembly via @asmcall.

Calling the intrinsic from Julia

CUDA.jl invokes LLVM intrinsics through ccall with the llvmcall calling convention. For example, the round-toward-plus-infinity FMA on Float64:

fma_rp(x::Float64, y::Float64, z::Float64) =
    ccall("llvm.nvvm.fma.rp.d", llvmcall, Cdouble, (Cdouble, Cdouble, Cdouble), x, y, z)

fma_rp (generic function with 1 method)

We can verify the generated PTX contains the expected instruction:

using CUDA

function fma_kernel(out, x, y, z)
    out[] = fma_rp(x, y, z)
    return
end

@device_code_ptx @cuda launch=false fma_kernel(CuArray{Float64}(undef, 1), 1.0, 1.0, 1.0)

// PTX CompilerJob of MethodInstance for Main.fma_kernel(::CuDeviceVector{Float64, 1}, ::Float64, ::Float64, ::Float64) for sm_80

┌ Warning: Syntax highlighting of ptx code relies on Pygments.
│ Use `pip install pygments` to install the lastest version
└ @ GPUCompiler ~/.cache/julia-buildkite-plugin/depots/3cc01fab-3357-4a7a-9294-cde2d3115a97/packages/GPUCompiler/KwfWk/src/reflection.jl:55
//
// Generated by LLVM NVPTX Back-End
//

.version 9.2
.target sm_80
.address_size 64

	// .globl	_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0_ // -- Begin function _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0_
.extern .func julia_throw_boundserror_31476
(
	.param .align 8 .b8 julia_throw_boundserror_31476_param_0[16]
)
.noreturn;
                                        // @_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0_
.visible .entry _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0_(
	.param .align 8 .b8 _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_0[16],
	.param .align 8 .b8 _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_1[32],
	.param .f64 _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_2,
	.param .f64 _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_3,
	.param .f64 _Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_4
)
{
	.reg .pred 	%p<2>;
	.reg .b32 	%r<2>;
	.reg .b64 	%rd<6>;
	.reg .f64 	%fd<5>;

// %bb.0:                               // %conversion
	ld.param.u64 	%rd5, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_1+24];
	setp.gt.s64 	%p1, %rd5, 0;
	@%p1 bra 	$L__BB0_2;
	bra.uni 	$L__BB0_1;
$L__BB0_2:                              // %L10
	ld.param.u64 	%rd2, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_1];
	ld.param.f64 	%fd3, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_4];
	ld.param.f64 	%fd2, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_3];
	ld.param.f64 	%fd1, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_2];
	fma.rp.f64 	%fd4, %fd1, %fd2, %fd3;
	st.global.f64 	[%rd2], %fd4;
	ret;
$L__BB0_1:                              // %L8
	ld.param.u32 	%r1, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_0+8];
	ld.param.u64 	%rd1, [_Z10fma_kernel13CuDeviceArrayI7Float64Li1ELi1EES0_S0_S0__param_0];
	{ // callseq 180, 0
	.reg .b32 temp_param_reg;
	.param .align 8 .b8 param0[16];
	st.param.b64 	[param0+0], %rd1;
	st.param.b32 	[param0+8], %r1;
	call.uni
	julia_throw_boundserror_31476,
	(
	param0
	);
	} // callseq 180
	// begin inline asm
	exit;
	// end inline asm
                                        // -- End function
}

Look for fma.rp.f64 in the listing above.

Exposing the full family

Once the intrinsic for one rounding mode is wired up, the remaining three follow the same pattern. CUDA.jl exposes the full {add,sub,mul,div,fma}_{rn,rz,rm,rp} family for both Float32 and Float64. The names mirror CUDA C's __fadd_rn/__dadd_rn/etc.:

function rounding_demo!(out, x, y, z)
    out[1] = CUDA.fma_rn(x, y, z)
    out[2] = CUDA.fma_rz(x, y, z)
    out[3] = CUDA.fma_rp(x, y, z)
    out[4] = CUDA.fma_rm(x, y, z)
    return
end

out_d = CUDA.zeros(Float64, 4)
@cuda threads=1 rounding_demo!(out_d, 1.0, 1.0, 2.0^-53)
Array(out_d)

4-element Vector{Float64}:
 1.0
 1.0
 1.0000000000000002
 1.0

The exact value of 1·1 + 2⁻⁵³ lies on the boundary between 1.0 and nextfloat(1.0), so the four entries above are 1.0 (RN, ties to even), 1.0 (RZ), nextfloat(1.0) (RP), and 1.0 (RM).

We deliberately don't extend Base.fma with a RoundingMode argument: Julia's Base has no precedent for per-call rounding on hardware floats (BigFloat reads a thread-local mode set via setrounding), so introducing such an API in CUDA.jl would not be portable. Code that needs directed rounding should call the *_rn/*_rz/*_rm/*_rp functions directly.

This page was generated using Literate.jl.