Compiler

Execution

The main entry-point to the compiler is the @cuda macro:

CUDA.@cuda — Macro

@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

dynamic: use dynamic parallelism to launch device-side kernels
arguments that influence kernel compilation: see cufunction and dynamic_cufunction
arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel

The underlying operations (argument conversion, kernel compilation, kernel call) can be performed explicitly when more control is needed, e.g. to reflect on the resource usage of a kernel to determine the launch configuration. A host-side kernel launch is done as follows:

args = ...
GC.@preserve args begin
    kernel_args = cudaconvert.(args)
    kernel_tt = Tuple{Core.Typeof.(kernel_args)...}
    kernel = cufunction(f, kernel_tt; compilation_kwargs)
    kernel(kernel_args...; launch_kwargs)
end

A device-side launch, aka. dynamic parallelism, is similar but more restricted:

args = ...
# GC.@preserve is not supported
# we're on the device already, so no need to cudaconvert
kernel_tt = Tuple{Core.Typeof(args[1]), ...}    # this needs to be fully inferred!
kernel = dynamic_cufunction(f, kernel_tt)       # no compiler kwargs supported
kernel(args...; launch_kwargs)

source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

CUDA.cudaconvert — Function

cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor type.

source

CUDA.cufunction — Function

cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

minthreads: the required number of threads in a thread block
maxthreads: the maximum number of threads in a thread block
blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
name: override the name that the kernel will have in the generated code

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

source

CUDA.HostKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

threads (defaults to 1)
blocks (defaults to 1)
shmem (defaults to 0)
config: callback function to dynamically compute the launch configuration. should accept a HostKernel and return a name tuple with any of the above as fields. this functionality is intended to be used in combination with the CUDA occupancy API.
stream (defaults to the default stream)

source

CUDA.version — Function

version()

Returns the CUDA version as reported by the driver.

source

version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

source

CUDA.maxthreads — Function

maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

source

CUDA.registers — Function

registers(k::HostKernel)

Queries the register usage of a kernel.

source

CUDA.memory — Function

memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_code

These macros are also available in function-form:

CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sass

For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:

CUDA.@device_code_sass — Macro

@device_code_sass [io::IO=stdout, ...] ex

Evaluates the expression ex and prints the result of CUDA.code_sass to io for every compiled CUDA kernel. For other supported keywords, see CUDA.code_sass.

source

CUDA.code_sass — Function

code_sass([io], f, types, cap::VersionNumber)

Prints the SASS code generated for the method matching the given generic function and type signature to io which defaults to stdout.

The following keyword arguments are supported:

cap which device to generate code for
kernel: treat the function as an entry-point kernel
verbose: enable verbose mode, which displays code generation statistics