CUDA driver

This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.

The documentation is grouped according to the modules of the driver API.

Error Handling

CUDA.CuErrorType
CuError(code)
CuError(code, details)

Create a CUDA error object with error code code. The optional details parameter indicates whether extra information, such as error logs, is known.

source
CUDA.nameMethod
name(err::CuError)

Gets the string representation of an error code.

julia> err = CuError(CUDA.cudaError_enum(1))
CuError(CUDA_ERROR_INVALID_VALUE)

julia> name(err)
"ERROR_INVALID_VALUE"
source

Version Management

CUDA.system_driver_versionMethod
system_driver_version()

Returns the latest version of CUDA supported by the original system driver, or nothing if the driver was not upgraded.

source
CUDA.set_runtime_version!Function
set_runtime_version!([version::VersionNumber]; local_toolkit=false)

Configures CUDA.jl to use a specific CUDA toolkit version from a specific source.

If local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.

See also: CUDA.reset_runtime_version!.

source

Device Management

CUDA.current_deviceFunction
current_device()

Returns the current device.

Warning

This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.

source
CUDA.nameMethod
name(dev::CuDevice)

Returns an identifier string for the device.

source
CUDA.totalmemMethod
totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

source
CUDA.attributeFunction
attribute(dev::CuDevice, code)

Returns information about the device.

source
attribute(X, pool::CuMemoryPool, attr)

Returns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source
attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source

Certain common attributes are exposed by additional convenience functions:

CUDA.warpsizeMethod
warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source

Context Management

CUDA.CuContextType
CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

source
CUDA.unsafe_destroy!Method
unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source
CUDA.current_contextFunction
current_context()

Returns the current context.

Warning

This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.

source
CUDA.activateMethod
activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

source
CUDA.synchronizeMethod
synchronize(ctx::Context)

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

source
CUDA.device_synchronizeFunction
device_synchronize()

Block for the all operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize which only synchronizes the stream associated with the current task.

On the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

source

Primary Context Management

CUDA.CuPrimaryContextType
CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

source
CUDA.CuContextMethod
CuContext(pctx::CuPrimaryContext)

Retain the primary context on the GPU, returning a context compatible with the driver API. The primary context will be released when the returned driver context is finalized.

As these contexts are refcounted by CUDA, you should not call CUDA.unsafe_destroy! on them but use CUDA.unsafe_release! instead (available with do-block syntax as well).

source
CUDA.isactiveMethod
isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

source
CUDA.flagsMethod
flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

source
CUDA.unsafe_reset!Method
unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

source
CUDA.unsafe_release!Method
CUDA.unsafe_release!(ctx::CuContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source

Module Management

CUDA.CuModuleType
CuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

source

Function Management

CUDA.CuFunctionType
CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

source

Global Variable Management

CUDA.CuGlobalType
CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

source
Base.eltypeMethod
eltype(var::CuGlobal)

Return the element type of a global variable object.

source
Base.getindexMethod
Base.getindex(var::CuGlobal)

Return the current value of a global variable.

source
Base.setindex!Method
Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

source

Linker

CUDA.add_data!Function
add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

source
add_data!(link::CuLink, name::String, data::Vector{UInt8}, type::CUjitInputType)

Add object code to a pending link operation.

source
CUDA.add_file!Function
add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

source
CUDA.CuLinkImageType

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

source
CUDA.completeFunction
complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

source
CUDA.CuModuleMethod
CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

source

Memory Management

Three kinds of memory buffers can be allocated: device memory, host memory, and unified memory. Each of these buffers can be allocated by calling alloc with the type of buffer as first argument, and freed by calling free. Certain buffers have specific methods defined.

CUDA.Mem.allocMethod
Mem.alloc(DeviceBuffer, bytesize::Integer;
          [async=false], [stream::CuStream], [pool::CuMemoryPool])

Allocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.

source
CUDA.Mem.HostBufferType
Mem.HostBuffer
Mem.Host

A buffer of pinned memory on the CPU, possibly accessible on the GPU.

source
CUDA.Mem.allocMethod
Mem.alloc(HostBuffer, bytesize::Integer, [flags])

Allocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to HOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to HOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:

flags = HOSTALLOC_PORTABLE | HOSTALLOC_DEVICEMAP
source
CUDA.Mem.registerMethod
Mem.register(HostBuffer, ptr::Ptr, bytesize::Integer, [flags])

Page-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the HOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the HOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.

source
CUDA.Mem.allocMethod
Mem.alloc(UnifiedBuffer, bytesize::Integer, [flags::CUmemAttach_flags])

Allocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.

source
CUDA.Mem.prefetchMethod
prefetch(::UnifiedBuffer, [bytes::Integer]; [device::CuDevice], [stream::CuStream])

Prefetches memory to the specified destination device.

source
CUDA.Mem.adviseMethod
advise(::UnifiedBuffer, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])

Advise about the usage of a given memory range.

source

To work with these buffers, you need to convert them to a Ptr or CuPtr. Several methods then work with these raw pointers:

Memory info

CUDA.memory_statusFunction
memory_status([io=stdout])

Report to io on the memory status of the current GPU and the active memory pool.

source
CUDA.available_memoryFunction
available_memory()

Returns the available amount of memory (in bytes), available for allocation by the CUDA context.

source
CUDA.total_memoryFunction
total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

source

Stream Management

CUDA.CuStreamType
CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

source
CUDA.isdoneMethod
isdone(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

source
CUDA.priority_rangeFunction
priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

source
CUDA.synchronizeMethod
synchronize([stream::CuStream])

Wait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.

See also: device_synchronize

source
CUDA.@syncMacro
@sync [blocking=false] ex

Run expression ex and synchronize the GPU afterwards.

The blocking keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization preference.

See also: synchronize.

source

For specific use cases, special streams are available:

CUDA.default_streamFunction
default_stream()

Return the default stream.

Note

It is generally better to use stream() to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.

source
CUDA.legacy_streamFunction
legacy_stream()

Return a special object to use use an implicit stream with legacy synchronization behavior.

You can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING). This matches the old pre-CUDA 7 global stream behavior.

source
CUDA.per_thread_streamFunction
per_thread_stream()

Return a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz or ptds suffix).

Note

It is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.

source

Event Management

CUDA.recordFunction
record(e::CuEvent, [stream::CuStream])

Record an event on a stream.

source
CUDA.isdoneMethod
isdone(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

source
CUDA.elapsedFunction
elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

source
CUDA.@elapsedMacro
@elapsed [blocking=false] ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

See also: @sync.

source

Execution Control

CUDA.CuDim3Type
CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

source
CUDA.cudacallFunction
cudacall(f, types, values...; blocks::CuDim, threads::CuDim,
         cooperative=false, shmem=0, stream=stream())

ccall-like interface for launching a CUDA function f on a GPU.

For example:

vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))

cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

source
CUDA.launchFunction
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source

Profiler Control

CUDA.@profileMacro
@profile [io=stdout] [trace=false] [raw=false] code...
@profile external=true code...

Profile the GPU execution of code.

There are two modes of operation, depending on whether external is true or false.

Integrated profiler (external=false, the default)

In this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.

Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.

!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.

!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.

External profilers (external=true)

For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.

source
CUDA.Profile.startFunction
start()

Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.

source
CUDA.Profile.stopFunction
stop()

Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.

source

Texture Memory

Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:

CUDA.CuTextureType
CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureMethod
CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

  • address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
  • interpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.
  • normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

!!! warning Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

source
CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

source

You can create CuTextureArray objects from both host and device memory:

CUDA.CuTextureArrayType
CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.

Warning

Experimental API. Subject to change without deprecation.

source
CUDA.CuTextureArrayMethod
CuTextureArray(A::AbstractArray)

Allocate and initialize a texture buffer from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

source
CuTextureArray(A::CuArray)

Allocate and initialize a texture buffer from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

source

Occupancy API

The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:

CUDA.launch_configurationFunction
launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

source
CUDA.active_blocksFunction
active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source
CUDA.occupancyFunction
occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source

Graph Execution

CUDA graphs can be easily recorded and executed using the high-level @captured macro:

CUDA.@capturedMacro
for ...
    @captured begin
        # code that executes several kernels or CUDA operations
    end
end

A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.

Warning

For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.

See also: capture.

source

Low-level operations are available too:

CUDA.CuGraphType
CuGraph([flags])

Create an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture. For a high-level interface that also automatically executes the graph, use the @captured macro.

source
CUDA.captureFunction
capture([flags], [throw_error::Bool=true]) do
    ...
end

Capture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.

Note that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error keyword to false, which will cause this function to return nothing if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.

See also: instantiate.

source
CUDA.instantiateFunction
instantiate(graph::CuGraph)

Creates an executable graph from a graph. This graph can then be launched, or updated with an other graph.

See also: launch, update.

source
CUDA.launchMethod
launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source
launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source
CUDA.updateFunction
update(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])

Check whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error is set to false, also throws an error if the update failed.

source