Compiler
Execution
The main entry-point to the compiler is the @cuda macro:
CUDACore.@cuda — Macro
@cuda [kwargs...] func(args...)High-level interface for executing code on a GPU. The @cuda macro should prefix a call, with func a callable function or object that should return nothing. It will be compiled to a CUDA function upon first use, and to a certain extent arguments will be converted and managed automatically using cudaconvert. Finally, a call to cudacall is performed, scheduling a kernel launch on the current CUDA context.
Several keyword arguments are supported that influence the behavior of @cuda.
launch: whether to launch this kernel, defaults totrue. Iffalsethe returned kernel object should be launched by calling it and passing arguments again.dynamic: use dynamic parallelism to launch device-side kernels, defaults tofalse.backend: which compiler backend to use, defaults toLLVMBackend. Either anAbstractBackendinstance or a module that definesDefaultBackend()(e.g.backend=CUDAresolves toCUDA.DefaultBackend()). Backend-specific compiler kwargs not recognized by@cudaitself are forwarded tokernel_compile.- arguments that influence kernel compilation: see
cufunctionanddynamic_cufunction - arguments that influence kernel launch: see
CUDACore.HostKernelandCUDACore.DeviceKernel
If needed, you can use a lower-level API that lets you inspect the compiler kernel:
CUDACore.cudaconvert — Function
cudaconvert(x)This function is called for every argument to be passed to a kernel, allowing it to be converted to a GPU-friendly format. By default, the function does nothing and returns the input object x as-is.
Do not add methods to this function, but instead extend the underlying Adapt.jl package and register methods for the the CUDA.KernelAdaptor type.
CUDACore.cufunction — Function
cufunction(f, tt=Tuple{}; kwargs...)Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. For a higher-level interface, use @cuda.
The following keyword arguments are supported:
minthreads: the required number of threads in a thread blockmaxthreads: the maximum number of threads in a thread blockblocks_per_sm: a minimum number of thread blocks to be scheduled on a single multiprocessormaxregs: the maximum number of registers to be allocated to a single thread (only supported on LLVM 4.0+)name: override the name that the kernel will have in the generated codealways_inline: inline all function calls in the kernelfastmath: use less precise square roots and flush denormalscapandptx: to override the compute capability and PTX version to compile for
The output of this function is automatically cached, i.e. you can simply call cufunction in a hot path without degrading performance. New code will be generated automatically, when when function changes, or when different types or keyword arguments are provided.
CUDACore.AbstractKernel — Type
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.
A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).
The following keyword arguments are supported:
threads(default:1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)for a 2D block of 32×32 threads). UsethreadIdx()andblockDim()to query from within the kernel.blocks(default:1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)for a 3D grid of blocks). UseblockIdx()andgridDim()to query from within the kernel.clustersize(default:1): Number of thread blocks to launch as a cooperative cluster, or a 1-, 2- or 3-tuple of dimensions (e.g.clustersize=(2, 2, 2)for a 3D grid). UseclusterIdx()andclusterDim()to query from within the kernel. Only supported on compute capability 9.0 and above. Ifclustersize=1, no clusters are launched.shmem(default:0): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray.stream(default:stream()):CuStreamto launch the kernel on.cooperative(default:false): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_gridandCG.sync). Note that this requires care wrt. the number of blocks launched.
CUDACore.HostKernel — Type
(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.
A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).
The following keyword arguments are supported:
threads(default:1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g.threads=(32, 32)for a 2D block of 32×32 threads). UsethreadIdx()andblockDim()to query from within the kernel.blocks(default:1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g.blocks=(2, 4, 2)for a 3D grid of blocks). UseblockIdx()andgridDim()to query from within the kernel.clustersize(default:1): Number of thread blocks to launch as a cooperative cluster, or a 1-, 2- or 3-tuple of dimensions (e.g.clustersize=(2, 2, 2)for a 3D grid). UseclusterIdx()andclusterDim()to query from within the kernel. Only supported on compute capability 9.0 and above. Ifclustersize=1, no clusters are launched.shmem(default:0): Amount of dynamic shared memory in bytes to allocate per thread block; used byCuDynamicSharedArray.stream(default:stream()):CuStreamto launch the kernel on.cooperative(default:false): whether to launch a cooperative kernel that supports grid synchronization (seeCG.this_gridandCG.sync). Note that this requires care wrt. the number of blocks launched.
CUDACore.version — Function
version(k::HostKernel)Queries the PTX and SM versions a kernel was compiled for. Returns a named tuple.
CUDACore.maxthreads — Function
maxthreads(k::HostKernel)Queries the maximum amount of threads a kernel can use in a single block.
CUDACore.registers — Function
registers(k::HostKernel)Queries the register usage of a kernel.
CUDACore.memory — Function
memory(k::HostKernel)Queries the local, shared and constant memory usage of a compiled kernel in bytes. Returns a named tuple.
To plug in alternative compiler back-ends (e.g. cuTile.jl), @cuda dispatches through a small protocol:
CUDACore.AbstractBackend — Type
AbstractBackendAbstract supertype for @cuda backend dispatch. The default backend is LLVMBackend, which compiles SIMT/PTX kernels via cufunction. Other backends (e.g. Tile IR via cuTile.jl) register a subtype and define methods for kernel_convert and kernel_compile; @cuda backend=... then routes through them.
@cuda backend=... accepts either an AbstractBackend instance or a module that defines DefaultBackend() returning one (e.g. @cuda backend=cuTile ... resolves to cuTile.DefaultBackend()).
CUDACore.LLVMBackend — Type
LLVMBackend()Default @cuda backend. Compiles SIMT/PTX kernels via cufunction and converts arguments via cudaconvert.
CUDACore.DefaultBackend — Function
DefaultBackend()Returns the default @cuda backend for this module (LLVMBackend). This makes @cuda backend=CUDA ... (or backend=CUDACore) resolve to LLVMBackend, mirroring the convention used by other backend packages (e.g. @cuda backend=cuTile ... resolves to cuTile.DefaultBackend()).
CUDACore.kernel_convert — Function
kernel_convert(backend, x)Convert a host-side launch argument to its kernel-side form. The default implementation for LLVMBackend forwards to cudaconvert; other backends override to produce backend-specific argument types.
CUDACore.kernel_compile — Function
kernel_compile(backend, f, tt::Type{<:Tuple}; kwargs...) -> AbstractKernelCompile a function for the given backend. Returns an AbstractKernel callable as kernel(args...; launch_kwargs...) to launch on the GPU. The default implementation for LLVMBackend is cufunction.
Reflection
If you want to inspect generated code, you can use macros that resemble functionality from the InteractiveUtils standard library:
@device_code_lowered
@device_code_typed
@device_code_warntype
@device_code_llvm
@device_code_ptx
@device_code_sass
@device_codeThese macros are also available in function-form:
CUDA.code_typed
CUDA.code_warntype
CUDA.code_llvm
CUDA.code_ptx
CUDA.code_sassFor more information, please consult the GPUCompiler.jl documentation. Only the code_sass functionality is actually defined in CUDA.jl:
CUDATools.@device_code_sass — Macro
@device_code_sass [io::IO=stdout, ...] exEvaluates the expression ex and prints the result of CUDATools.code_sass to io for every executed CUDA kernel. For other supported keywords, see CUDATools.code_sass.
CUDATools.code_sass — Function
code_sass([io], f, types; raw=false)
code_sass(f, [io]; raw=false)Prints the SASS code corresponding to one or more CUDA modules to io, which defaults to stdout.
If providing both f and types, it is assumed that this uniquely identifies a kernel function, for which SASS code will be generated, and printed to io.
If only providing a callable function f, typically specified using the do syntax, the SASS code for all modules executed during evaluation of f will be printed. This can be convenient to display the SASS code for functions whose source code is not available.
raw: dump the assembly likenvdisasmreports it, without post-processing;- in the case of specifying
fandtypes: all keyword arguments fromcufunction
See also: @device_code_sass