CUDA driver

This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.

The documentation is grouped according to the modules of the driver API.

Error Handling

CUDA.CuError — Type

CuError(code)

Create a CUDA error object with error code code.

source

CUDA.name — Method

name(err::CuError)

Gets the string representation of an error code.

julia> err = CuError(CUDA.cudaError_enum(1))
CuError(CUDA_ERROR_INVALID_VALUE)

julia> name(err)
"ERROR_INVALID_VALUE"

source

CUDA.description — Method

description(err::CuError)

Gets the string description of an error code.

source

Version Management

CUDA.driver_version — Method

driver_version()

Returns the latest version of CUDA supported by the loaded driver.

source

CUDA.runtime_version — Method

runtime_version()

Returns the CUDA Runtime version.

source

CUDA.set_runtime_version! — Function

CUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])

Configures the active project to use a specific CUDA toolkit version from a specific source.

If local_toolkit is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version controls which version will be downloaded and used.

When not specifying either the version or the local_toolkit argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version! instead.

source

CUDA.reset_runtime_version! — Function

CUDA.reset_runtime_version!()

Resets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version! with no arguments.

source

Device Management

CUDA.CuDevice — Type

CuDevice(ordinal::Integer)

Get a handle to a compute device.

source

CUDA.devices — Function

devices()

Get an iterator for the compute devices.

source

CUDA.current_device — Function

current_device()

Returns the current device.

Warning

This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device method instead.

source

CUDA.name — Method

name(dev::CuDevice)

Returns an identifier string for the device.

source

CUDA.totalmem — Method

totalmem(dev::CuDevice)

Returns the total amount of memory (in bytes) on the device.

source

CUDA.attribute — Function

attribute(dev::CuDevice, code)

Returns information about the device.

source

attribute(X, pool::CuMemoryPool, attr)

Returns attribute attr about pool. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source

attribute(X, ptr::Union{Ptr,CuPtr}, attr)

Returns attribute attr about pointer ptr. The type of the returned value depends on the attribute, and as such must be passed as the X parameter.

source

Certain common attributes are exposed by additional convenience functions:

CUDA.capability — Method

capability(dev::CuDevice)

Returns the compute capability of the device.

source

CUDA.warpsize — Method

warpsize(dev::CuDevice)

Returns the warp size (in threads) of the device.

source

Context Management

CUDA.CuContext — Type

CuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)

Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.

When you are done using the context, call CUDA.unsafe_destroy! to mark it for deletion, or use do-block syntax with this constructor.

source

CUDA.unsafe_destroy! — Method

unsafe_destroy!(ctx::CuContext)

Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source

CUDA.current_context — Function

current_context()

Returns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.

Warning

This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context method instead.

source

CUDA.activate — Method

activate(ctx::CuContext)

Binds the specified CUDA context to the calling CPU thread.

source

CUDA.synchronize — Method

synchronize(ctx::Context)

Block for all the operations on ctx to complete. This is a heavyweight operation, typically you only need to call synchronize() which only synchronizes the stream associated with the current task.

source

CUDA.device_synchronize — Function

device_synchronize()

Block for all the operations on the device to complete. This is a heavyweight operation, typically you only need to call synchronize() which only synchronizes the stream associated with the current task.

On the device, device_synchronize acts as a synchronization point for child grids in the context of dynamic parallelism.

source

Primary Context Management

CUDA.CuPrimaryContext — Type

CuPrimaryContext(dev::CuDevice)

Create a primary CUDA context for a given device.

Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.

source

CUDA.CuContext — Method

CuContext(pctx::CuPrimaryContext)

Derive a context from a primary context.

Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy! function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!, or set to zero by calling unsafe_reset!. The easiest way to do this is by using the do-block syntax.

source

CUDA.isactive — Method

isactive(pctx::CuPrimaryContext)

Query whether a primary context is active.

source

CUDA.flags — Method

flags(pctx::CuPrimaryContext)

Query the flags of a primary context.

source

CUDA.setflags! — Method

setflags!(pctx::CuPrimaryContext)

Set the flags of a primary context.

source

CUDA.unsafe_reset! — Method

unsafe_reset!(pctx::CuPrimaryContext)

Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.

source

CUDA.unsafe_release! — Method

CUDA.unsafe_release!(pctx::CuPrimaryContext)

Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.

source

Module Management

CUDA.CuModule — Type

CuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})

Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.

The options is an optional dictionary of JIT options and their respective value.

source

Function Management

CUDA.CuFunction — Type

CuFunction(mod::CuModule, name::String)

Acquires a function handle from a named function in a module.

source

Global Variable Management

CUDA.CuGlobal — Type

CuGlobal{T}(mod::CuModule, name::String)

Acquires a typed global variable handle from a named global in a module.

source

Base.eltype — Method

eltype(var::CuGlobal)

Return the element type of a global variable object.

source

Base.getindex — Method

Base.getindex(var::CuGlobal)

Return the current value of a global variable.

source

Base.setindex! — Method

Base.setindex(var::CuGlobal{T}, val::T)

Set the value of a global variable to val

source

Linker

CUDA.CuLink — Type

CuLink()

Creates a pending JIT linker invocation.

source

CUDA.add_data! — Function

add_data!(link::CuLink, name::String, code::String)

Add PTX code to a pending link operation.

source

add_data!(link::CuLink, name::String, data::Vector{UInt8})

Add object code to a pending link operation.

source

CUDA.add_file! — Function

add_file!(link::CuLink, path::String, typ::CUjitInputType)

Add data from a file to a link operation. The argument typ indicates the type of the contained data.

source

CUDA.CuLinkImage — Type

The result of a linking operation.

This object keeps its parent linker object alive, as destroying a linker destroys linked images too.

source

CUDA.complete — Function

complete(link::CuLink)

Complete a pending linker invocation, returning an output image.

source

CUDA.CuModule — Method

CuModule(img::CuLinkImage, ...)

Create a CUDA module from a completed linking operation. Options from CuModule apply.

source

Memory Management

Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc with the type of memory as first argument, and freed by calling free. Certain kinds of memory have specific methods defined.

Device memory

This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.

CUDA.DeviceMemory — Type

DeviceMemory

Device memory residing on the GPU.

source

CUDA.alloc — Method

alloc(DeviceMemory, bytesize::Integer;
      [async=false], [stream::CuStream], [pool::CuMemoryPool])

Allocate bytesize bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!, which wraps cuMemcpy, for access on the CPU.

source

Unified memory

Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.

CUDA.UnifiedMemory — Type

UnifiedMemory

Unified memory that is accessible on both the CPU and GPU.

source

CUDA.alloc — Method

alloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])

Allocate bytesize bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.

source

CUDA.prefetch — Method

prefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])

Prefetches memory to the specified destination device.

source

CUDA.advise — Method

advise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])

Advise about the usage of a given memory range.

source

Host memory

Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).

CUDA.HostMemory — Type

HostMemory

Pinned memory residing on the CPU, possibly accessible on the GPU.

source

CUDA.alloc — Method

alloc(HostMemory, bytesize::Integer, [flags])

Allocate bytesize bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags is set to MEMHOSTALLOC_DEVICEMAP the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags is set to MEMHOSTALLOC_PORTABLE, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags can be set at one time using a bytewise OR:

flags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP

source

CUDA.register — Method

register(HostMemory, ptr::Ptr, bytesize::Integer, [flags])

Page-lock the host memory pointed to by ptr. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE flag is specified, any CUDA context can access the memory.

source

CUDA.unregister — Method

unregister(::HostMemory)

Unregisters a memory range that was registered with register.

source

Array memory

Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.

CUDA.ArrayMemory — Type

ArrayMemory

Array memory residing on the GPU, possibly in a specially-formatted way.

source

CUDA.alloc — Method

alloc(ArrayMemory, dims::Dims)

Allocate array memory with dimensions dims. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).

source

Pointers

To work with these buffers, you need to convert them to a Ptr, CuPtr, or in the case of ArrayMemory an CuArrayPtr. You can then use common Julia methods on these pointers, such as unsafe_copyto!. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:

CUDA.unsafe_copy2d! — Function

unsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;
               dstPos=(1,1), dstPitch=0,
               srcPos=(1,1), srcPitch=0,
               async=false, stream=nothing)

Perform a 2D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source

CUDA.unsafe_copy3d! — Function

unsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;
               dstPos=(1,1,1), dstPitch=0, dstHeight=0,
               srcPos=(1,1,1), srcPitch=0, srcHeight=0,
               async=false, stream=nothing)

Perform a 3D memory copy between pointers src and dst, at respectively position srcPos and dstPos (1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async is set, otherwise stream is synchronized.

source

CUDA.memset — Function

memset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])

Initialize device memory by copying val for len times.

source

Other

CUDA.free_memory — Function

free_memory()

Returns the free amount of memory (in bytes), available for allocation by the CUDA context.

source

CUDA.total_memory — Function

total_memory()

Returns the total amount of memory (in bytes), available for allocation by the CUDA context.

source

Stream Management

CUDA.CuStream — Type

CuStream(; flags=STREAM_DEFAULT, priority=nothing)

Create a CUDA stream.

source

CUDA.isdone — Method

isdone(s::CuStream)

Return false if a stream is busy (has task running or queued) and true if that stream is free.

source

CUDA.priority_range — Function

priority_range()

Return the valid range of stream priorities as a StepRange (with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).

source

CUDA.priority — Function

priority_range(s::CuStream)

Return the priority of a stream s.

source

CUDA.synchronize — Method

synchronize([stream::CuStream])

Wait until stream has finished executing, with stream defaulting to the stream associated with the current Julia task.

Event Management

CUDA.CuEvent — Type

CuEvent()

Create a new CUDA event.

source

CUDA.record — Function

record(e::CuEvent, [stream::CuStream])

Record an event on a stream.

source

CUDA.synchronize — Method

synchronize(e::CuEvent)

Waits for an event to complete.

source

CUDA.isdone — Method

isdone(e::CuEvent)

Return false if there is outstanding work preceding the most recent call to record(e) and true if all captured work has been completed.

source

CUDA.wait — Method

wait(e::CuEvent, [stream::CuStream])

Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent) for that.

source

CUDA.elapsed — Function

elapsed(start::CuEvent, stop::CuEvent)

Computes the elapsed time between two events (in seconds).

source

CUDA.@elapsed — Macro

@elapsed [blocking=false] ex

A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.

Execution Control

CUDA.CuDim3 — Type

CuDim3(x)

CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))

A type used to specify dimensions, consisting of 3 integers for respectively the x, y and z dimension. Unspecified dimensions default to 1.

Often accepted as argument through the CuDim type alias, eg. in the case of cudacall or CUDA.launch, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3 object.

source

CUDA.cudacall — Function

cudacall(f, types, values...; blocks::CuDim, threads::CuDim,
         cooperative=false, shmem=0, stream=stream())

ccall-like interface for launching a CUDA function f on a GPU.

For example:

vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))

cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))

The blocks and threads arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.

source

CUDA.launch — Function

launch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
       cooperative=false, shmem=0, stream=stream())

Low-level call to launch a CUDA function f on the GPU, using blocks and threads as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem, and the kernel is launched on stream stream.

Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.

This is a low-level call, prefer to use cudacall instead.

source

launch(exec::CuGraphExec, [stream::CuStream])

Launches an executable graph, by default in the currently-active stream.

source

Profiler Control

CUDA.@profile — Macro

@profile [trace=false] [raw=false] code...
@profile external=true code...

Profile the GPU execution of code.

There are two modes of operation, depending on whether external is true or false. The default value depends on whether Julia is being run under an external profiler.

Integrated profiler (external=false, the default)

In this mode, CUDA.jl will profile the execution of code and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace can be set to true. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw is true, all data will always be included, even if it may not be relevant. The output will be written to io, which defaults to stdout.

Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.

!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.

!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile macro to work. It is recommended to use a newer runtime.

External profilers (external=true, when an external profiler is detected)

For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true, which used to be the only way to use this macro.

source

CUDA.Profile.start — Function

start()

Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.

source

CUDA.Profile.stop — Function

stop()

Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.

source

Texture Memory

Textures are represented by objects of type CuTexture which are bound to some underlying memory, either CuArrays or CuTextureArrays:

CUDA.CuTexture — Type

CuTexture{T,N,P}

N-dimensional texture object with elements of type T. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture type.

Warning

Experimental API. Subject to change without deprecation.

source

CUDA.CuTexture — Method

CuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)

Construct a N-dimensional texture object with elements of type T as stored in parent.

Several keyword arguments alter the behavior of texture objects:

address_mode (wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple of N entries.
interpolation (nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.
normalized_coordinates (true, false): whether indices are expected to fall in the normalized [0:1) range.

!!! warning Experimental API. Subject to change without deprecation.

source

CuTexture(x::CuTextureArray{T,N})

Create a N-dimensional texture object withelements of type T that will be read from x.

Warning

Experimental API. Subject to change without deprecation.

source

CuTexture(x::CuArray{T,N})

Create a N-dimensional texture object that reads from a CuArray.

Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.

Warning

Experimental API. Subject to change without deprecation.

source

You can create CuTextureArray objects from both host and device memory:

CUDA.CuTextureArray — Type

CuTextureArray{T,N}(undef, dims)

N-dimensional dense texture array with elements of type T. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P} objects.

Warning

Experimental API. Subject to change without deprecation.

source

CUDA.CuTextureArray — Method

CuTextureArray(A::AbstractArray)

Allocate and initialize a texture array from host memory in A.

Warning

Experimental API. Subject to change without deprecation.

source

CuTextureArray(A::CuArray)

Allocate and initialize a texture array from device memory in A.

Warning

Experimental API. Subject to change without deprecation.

source

Occupancy API

The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction) on the current device:

CUDA.launch_configuration — Function

launch_configuration(fun::CuFunction; shmem=0, max_threads=0)

Calculate a suggested launch configuration for kernel fun requiring shmem bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads.

In the case of a variable amount of shared memory, pass a callable object for shmem instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.

source

CUDA.active_blocks — Function

active_blocks(fun::CuFunction, threads; shmem=0)

Calculate the maximum number of active blocks per multiprocessor when running threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source

CUDA.occupancy — Function

occupancy(fun::CuFunction, threads; shmem=0)

Calculate the theoretical occupancy of launching threads threads of a kernel fun requiring shmem bytes of dynamic shared memory.

source

Graph Execution

CUDA graphs can be easily recorded and executed using the high-level @captured macro:

CUDA.@captured — Macro

for ...
    @captured begin
        # code that executes several kernels or CUDA operations
    end
end

A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.

Warning

For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.