Compiler
Execution
The main entry-point to the compiler is the @cuda
macro:
CUDA.@cuda
— Macro@cuda [kwargs...] func(args...)
High-level interface for executing code on a GPU. The @cuda
macro should prefix a call, with func
a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert
. Finally, a call to cudacall
is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda
.
launch
: whether to launch this kernel, defaults totrue
. Iffalse
the returned kernel object should be launched by calling it and passing arguments again.dynamic
: use dynamic parallelism to launch device-side kernels, defaults tofalse
.- arguments that influence kernel compilation: see
cufunction
anddynamic_cufunction
- arguments that influence kernel launch: see
CUDA.HostKernel
andCUDA.DeviceKernel
If needed, you can use a lower-level API that lets you inspect the compiler kernel:
CUDA.cudaconvert
— Functioncudaconvert(x)
This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x
as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor
type.
CUDA.cufunction
— Functioncufunction(f, tt=Tuple{}; kwargs...)
Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda
.
The following keyword arguments are supported:
minthreads
: the required number of threads in a thread blockmaxthreads
: the maximum number of threads in a thread blockblocks_per_sm
: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs
: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name
: override the name that the kernel will have in the generated codealways_inline
: inline all function calls in the kernelfastmath
: use less precise square roots and flush denormalscap
andptx
: to override the compute capability and PTX version to compile for
The output of this function is automatically cached, i.e. you can simply call cufunction
in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
CUDA.HostKernel
— Type(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)
Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args
. For a higher-level interface, use @cuda
.
A HostKernel
is callable on the host, and a DeviceKernel
is callable on the device (created by @cuda
with dynamic=true
).
The following keyword arguments are supported:
threads
(default:1
): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)
for a 2D block of 32×32 threads). UsethreadIdx()
andblockDim()
to query from within the kernel.blocks
(default:1
): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)
for a 3D grid of blocks). UseblockIdx()
andgridDim()
to query from within the kernel.shmem
(default:0
): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray
.stream
(default:stream()
):CuStream
to launch the kernel on.cooperative
(default:false
): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_grid
andCG.sync
). Note that this requires care wrt. the number of blocks launched.
CUDA.version
— Functionversion(k::HostKernel)
Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.
CUDA.maxthreads
— Functionmaxthreads(k::HostKernel)
Queries the maximum amount of threads a kernel can use in a single block.
CUDA.registers
— Functionregisters(k::HostKernel)
Queries the register usage of a kernel.
CUDA.memory
— Functionmemory(k::HostKernel)
Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.
Reflection
If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:
@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_code
These macros are also available in function-form:
CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sass
For more information, please consult the GPUCompiler.jl documentation. Only the code_sass
functionality is actually defined in CUDA.jl:
CUDA.@device_code_sass
— Macro@device_code_sass [io::IO=stdout, ...] ex
Evaluates the expression ex
and prints the result of CUDA.code_sass
to io
for every executed CUDA kernel. For other supported keywords, see CUDA.code_sass
.
CUDA.code_sass
— Functioncode_sass([io], f, types; raw=false)
code_sass(f, [io]; raw=false)
Prints the SASS code corresponding to one or more CUDA modules to io
, which defaults to stdout
.
If providing both f
and types
, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io
.
If only providing a callable function f
, typically specified using the do
syntax, the SASS code for all modules executed during evaluation of f
will be printed. This can be convenient to display the SASS code for functions whose source code is not available.
raw
: dump the assembly likenvdisasm
reports it, without post-processing;- in the case of specifying
f
andtypes
: all keyword arguments fromcufunction
See also: @device_code_sass