Compiler
Execution
The main entry-point to the compiler is the @cuda macro:
CUDA.@cuda — Macro@cuda [kwargs...] func(args...)High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda.
launch: whether to launch this kernel, defaults totrue. Iffalsethe returned kernel object should be launched by calling it and passing arguments again.dynamic: use dynamic parallelism to launch device-side kernels, defaults tofalse.- arguments that influence kernel compilation: see
cufunctionanddynamic_cufunction - arguments that influence kernel launch: see
CUDA.HostKernelandCUDA.DeviceKernel
If needed, you can use a lower-level API that lets you inspect the compiler kernel:
CUDA.cudaconvert — Functioncudaconvert(x)This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.Adaptor type.
CUDA.cufunction — Functioncufunction(f, tt=Tuple{}; kwargs...)Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.
The following keyword arguments are supported:
minthreads: the required number of threads in a thread blockmaxthreads: the maximum number of threads in a thread blockblocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name: override the name that the kernel will have in the generated code
The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
CUDA.HostKernel — Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.
The following keyword arguments are supported:
threads(defaults to 1)blocks(defaults to 1)shmem(defaults to 0)stream(defaults to the default stream)
CUDA.version — Functionversion()Returns the CUDA version as reported by the driver.
version(k::HostKernel)Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.
CUDA.maxthreads — Functionmaxthreads(k::HostKernel)Queries the maximum amount of threads a kernel can use in a single block.
CUDA.registers — Functionregisters(k::HostKernel)Queries the register usage of a kernel.
CUDA.memory — Functionmemory(k::HostKernel)Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.
Reflection
If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:
@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_codeThese macros are also available in function-form:
CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sassFor more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:
CUDA.@device_code_sass — Macro@device_code_sass [io::IO=stdout, ...] exEvaluates the expression ex and prints the result of CUDA.code_sass to io for every compiled CUDA kernel. For other supported keywords, see CUDA.code_sass.
CUDA.code_sass — Functioncode_sass([io], f, types, cap::VersionNumber)Prints the SASS code generated for the method matching the given generic function and type signature to io which defaults to stdout.
The following keyword arguments are supported:
capwhich device to generate code forkernel: treat the function as an entry-point kernelverbose: enable verbose mode, which displays code generation statistics
See also: @device_code_sass