Compiler

Execution

The main entry-point to the compiler is the @cuda macro:

CUDA.@cudaMacro
@cuda [kwargs...] func(args...)

High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.

Several keyword arguments are supported that influence the behavior of @cuda.

  • launch: whether to launch this kernel, defaults to true. If false the returned kernel object should be launched by calling it and passing arguments again.
  • dynamic: use dynamic parallelism to launch device-side kernels, defaults to false.
  • arguments that influence kernel compilation: see cufunction and dynamic_cufunction
  • arguments that influence kernel launch: see CUDA.HostKernel and CUDA.DeviceKernel
source

If needed, you can use a lower-level API that lets you inspect the compiler kernel:

CUDA.cudaconvertFunction
cudaconvert(x)

This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.

Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.

source
CUDA.cufunctionFunction
cufunction(f, tt=Tuple{}; kwargs...)

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.

The following keyword arguments are supported:

  • minthreads: the required number of threads in a thread block
  • maxthreads: the maximum number of threads in a thread block
  • blocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessor
  • maxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)
  • name: override the name that the kernel will have in the generated code
  • always_inline: inline all function calls in the kernel
  • fastmath: use less precise square roots and flush denormals
  • cap and ptx: to override the compute capability and PTX version to compile for

The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.

source
CUDA.HostKernelType
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

  • threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
  • blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
  • shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
  • stream (default: stream()): CuStream to launch the kernel on.
  • cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.
source
CUDA.versionFunction
version(k::HostKernel)

Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.

source
CUDA.maxthreadsFunction
maxthreads(k::HostKernel)

Queries the maximum amount of threads a kernel can use in a single block.

source
CUDA.memoryFunction
memory(k::HostKernel)

Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.

source

Reflection

If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:

@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_code

These macros are also available in function-form:

CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sass

For more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:

CUDA.code_sassFunction
code_sass([io], f, types; raw=false)
code_sass(f, [io]; raw=false)

Prints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.

If providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.

If only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.

  • raw: dump the assembly like nvdisasm reports it, without post-processing;
  • in the case of specifying f and types: all keyword arguments from cufunction

See also: @device_code_sass

source