CUDA driver
This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.
The documentation is grouped according to the modules of the driver API.
Error Handling
CUDA.CuError
— TypeCuError(code)
Create a CUDA error object with error code code
.
CUDA.name
— Methodname(err::CuError)
Gets the string representation of an error code.
julia> err = CuError(CUDA.cudaError_enum(1))
CuError(CUDA_ERROR_INVALID_VALUE)
julia> name(err)
"ERROR_INVALID_VALUE"
CUDA.description
— Methoddescription(err::CuError)
Gets the string description of an error code.
Version Management
CUDA.driver_version
— Methoddriver_version()
Returns the latest version of CUDA supported by the loaded driver.
CUDA.runtime_version
— Methodruntime_version()
Returns the CUDA Runtime version.
CUDA.set_runtime_version!
— FunctionCUDA.set_runtime_version!([version::VersionNumber]; [local_toolkit::Bool])
Configures the active project to use a specific CUDA toolkit version from a specific source.
If local_toolkit
is set, the CUDA toolkit will be used from the local system, otherwise it will be downloaded from an artifact source. In the case of a local toolkit, version
informs CUDA.jl which version that is (this may be useful if auto-detection fails). In the case of artifact sources, version
controls which version will be downloaded and used.
When not specifying either the version
or the local_toolkit
argument, the default behavior will be used, which is to use the most recent compatible runtime available from an artifact source. Note that this will override any Preferences that may be configured in a higher-up depot; to clear preferences nondestructively, use CUDA.reset_runtime_version!
instead.
CUDA.reset_runtime_version!
— FunctionCUDA.reset_runtime_version!()
Resets the CUDA version preferences in the active project to the default, which is to use the most recent compatible runtime available from an artifact source, unless a higher-up depot has configured a different preference. To force use of the default behavior for the local project, use CUDA.set_runtime_version!
with no arguments.
Device Management
CUDA.CuDevice
— TypeCuDevice(ordinal::Integer)
Get a handle to a compute device.
CUDA.devices
— Functiondevices()
Get an iterator for the compute devices.
CUDA.current_device
— Functioncurrent_device()
Returns the current device.
This is a low-level API, returning the current device as known to the CUDA driver. For most users, it is recommended to use the device
method instead.
CUDA.name
— Methodname(dev::CuDevice)
Returns an identifier string for the device.
CUDA.totalmem
— Methodtotalmem(dev::CuDevice)
Returns the total amount of memory (in bytes) on the device.
CUDA.attribute
— Functionattribute(dev::CuDevice, code)
Returns information about the device.
attribute(X, pool::CuMemoryPool, attr)
Returns attribute attr
about pool
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
attribute(X, ptr::Union{Ptr,CuPtr}, attr)
Returns attribute attr
about pointer ptr
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
Certain common attributes are exposed by additional convenience functions:
CUDA.capability
— Methodcapability(dev::CuDevice)
Returns the compute capability of the device.
CUDA.warpsize
— Methodwarpsize(dev::CuDevice)
Returns the warp size (in threads) of the device.
Context Management
CUDA.CuContext
— TypeCuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)
Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.
When you are done using the context, call CUDA.unsafe_destroy!
to mark it for deletion, or use do-block syntax with this constructor.
CUDA.unsafe_destroy!
— Methodunsafe_destroy!(ctx::CuContext)
Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.current_context
— Functioncurrent_context()
Returns the current context. Throws an undefined reference error if the current thread has no context bound to it, or if the bound context has been destroyed.
This is a low-level API, returning the current context as known to the CUDA driver. For most users, it is recommended to use the context
method instead.
CUDA.activate
— Methodactivate(ctx::CuContext)
Binds the specified CUDA context to the calling CPU thread.
CUDA.synchronize
— Methodsynchronize(ctx::Context)
Block for the all operations on ctx
to complete. This is a heavyweight operation, typically you only need to call synchronize
which only synchronizes the stream associated with the current task.
CUDA.device_synchronize
— Functiondevice_synchronize()
Block for the all operations on ctx
to complete. This is a heavyweight operation, typically you only need to call synchronize
which only synchronizes the stream associated with the current task.
On the device, device_synchronize
acts as a synchronization point for child grids in the context of dynamic parallelism.
Primary Context Management
CUDA.CuPrimaryContext
— TypeCuPrimaryContext(dev::CuDevice)
Create a primary CUDA context for a given device.
Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.
CUDA.CuContext
— MethodCuContext(pctx::CuPrimaryContext)
Derive a context from a primary context.
Calling this function increases the reference count of the primary context. The returned context should not be free with the unsafe_destroy!
function that's used with ordinary contexts. Instead, the refcount of the primary context should be decreased by calling unsafe_release!
, or set to zero by calling unsafe_reset!
. The easiest way to do this is by using the do
-block syntax.
CUDA.isactive
— Methodisactive(pctx::CuPrimaryContext)
Query whether a primary context is active.
CUDA.flags
— Methodflags(pctx::CuPrimaryContext)
Query the flags of a primary context.
CUDA.setflags!
— Methodsetflags!(pctx::CuPrimaryContext)
Set the flags of a primary context.
CUDA.unsafe_reset!
— Methodunsafe_reset!(pctx::CuPrimaryContext)
Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.
CUDA.unsafe_release!
— MethodCUDA.unsafe_release!(pctx::CuPrimaryContext)
Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
Module Management
CUDA.CuModule
— TypeCuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})
Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.
The options
is an optional dictionary of JIT options and their respective value.
Function Management
CUDA.CuFunction
— TypeCuFunction(mod::CuModule, name::String)
Acquires a function handle from a named function in a module.
Global Variable Management
CUDA.CuGlobal
— TypeCuGlobal{T}(mod::CuModule, name::String)
Acquires a typed global variable handle from a named global in a module.
Base.eltype
— Methodeltype(var::CuGlobal)
Return the element type of a global variable object.
Base.getindex
— MethodBase.getindex(var::CuGlobal)
Return the current value of a global variable.
Base.setindex!
— MethodBase.setindex(var::CuGlobal{T}, val::T)
Set the value of a global variable to val
Linker
CUDA.CuLink
— TypeCuLink()
Creates a pending JIT linker invocation.
CUDA.add_data!
— Functionadd_data!(link::CuLink, name::String, code::String)
Add PTX code to a pending link operation.
add_data!(link::CuLink, name::String, data::Vector{UInt8})
Add object code to a pending link operation.
CUDA.add_file!
— Functionadd_file!(link::CuLink, path::String, typ::CUjitInputType)
Add data from a file to a link operation. The argument typ
indicates the type of the contained data.
CUDA.CuLinkImage
— TypeThe result of a linking operation.
This object keeps its parent linker object alive, as destroying a linker destroys linked images too.
CUDA.complete
— Functioncomplete(link::CuLink)
Complete a pending linker invocation, returning an output image.
CUDA.CuModule
— MethodCuModule(img::CuLinkImage, ...)
Create a CUDA module from a completed linking operation. Options from CuModule
apply.
Memory Management
Different kinds of memory objects can be created, representing different kinds of memory that the CUDA toolkit supports. Each of these memory objects can be allocated by calling alloc
with the type of memory as first argument, and freed by calling free
. Certain kinds of memory have specific methods defined.
Device memory
This memory is accessible only by the GPU, and is the most common kind of memory used in CUDA programming.
CUDA.DeviceMemory
— TypeDeviceMemory
Device memory residing on the GPU.
CUDA.alloc
— Methodalloc(DeviceMemory, bytesize::Integer;
[async=false], [stream::CuStream], [pool::CuMemoryPool])
Allocate bytesize
bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!
, which wraps cuMemcpy
, for access on the CPU.
Unified memory
Unified memory is accessible by both the CPU and the GPU, and is managed by the CUDA runtime. It is automatically migrated between the CPU and the GPU as needed, which simplifies programming but can lead to performance issues if not used carefully.
CUDA.UnifiedMemory
— TypeUnifiedMemory
Unified memory that is accessible on both the CPU and GPU.
CUDA.alloc
— Methodalloc(UnifiedMemory, bytesize::Integer, [flags::CUmemAttach_flags])
Allocate bytesize
bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.
CUDA.prefetch
— Methodprefetch(::UnifiedMemory, [bytes::Integer]; [device::CuDevice], [stream::CuStream])
Prefetches memory to the specified destination device.
CUDA.advise
— Methodadvise(::UnifiedMemory, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])
Advise about the usage of a given memory range.
Host memory
Host memory resides on the CPU, but is accessible by the GPU via the PCI bus. This is the slowest kind of memory, but is useful for communicating between running kernels and the host (e.g., to update counters or flags).
CUDA.HostMemory
— TypeHostMemory
Pinned memory residing on the CPU, possibly accessible on the GPU.
CUDA.alloc
— Methodalloc(HostMemory, bytesize::Integer, [flags])
Allocate bytesize
bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags
is set to MEMHOSTALLOC_DEVICEMAP
the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags
is set to MEMHOSTALLOC_PORTABLE
, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags
can be set at one time using a bytewise OR
:
flags = MEMHOSTALLOC_PORTABLE | MEMHOSTALLOC_DEVICEMAP
CUDA.register
— Methodregister(HostMemory, ptr::Ptr, bytesize::Integer, [flags])
Page-lock the host memory pointed to by ptr
. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the MEMHOSTREGISTER_DEVICEMAP
flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the MEMHOSTREGISTER_PORTABLE
flag is specified, any CUDA context can access the memory.
CUDA.unregister
— Methodunregister(::HostMemory)
Unregisters a memory range that was registered with register
.
Array memory
Array memory is a special kind of memory that is optimized for 2D and 3D access patterns. The memory is opaquely managed by the CUDA runtime, and is typically only used on combination with texture intrinsics.
CUDA.ArrayMemory
— TypeArrayMemory
Array memory residing on the GPU, possibly in a specially-formatted way.
CUDA.alloc
— Methodalloc(ArrayMemory, dims::Dims)
Allocate array memory with dimensions dims
. The memory is accessible on the GPU, but can only be used in conjunction with special intrinsics (e.g., texture intrinsics).
Pointers
To work with these buffers, you need to convert
them to a Ptr
, CuPtr
, or in the case of ArrayMemory
an CuArrayPtr
. You can then use common Julia methods on these pointers, such as unsafe_copyto!
. CUDA.jl also provides some specialized functionality that does not match standard Julia functionality:
CUDA.unsafe_copy2d!
— Functionunsafe_copy2d!(dst, dstTyp, src, srcTyp, width, height=1;
dstPos=(1,1), dstPitch=0,
srcPos=(1,1), srcPitch=0,
async=false, stream=nothing)
Perform a 2D memory copy between pointers src
and dst
, at respectively position srcPos
and dstPos
(1-indexed). Pitch can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async
is set, otherwise stream
is synchronized.
CUDA.unsafe_copy3d!
— Functionunsafe_copy3d!(dst, dstTyp, src, srcTyp, width, height=1, depth=1;
dstPos=(1,1,1), dstPitch=0, dstHeight=0,
srcPos=(1,1,1), srcPitch=0, srcHeight=0,
async=false, stream=nothing)
Perform a 3D memory copy between pointers src
and dst
, at respectively position srcPos
and dstPos
(1-indexed). Both pitch and height can be specified for both the source and destination; consult the CUDA documentation for more details. This call is executed asynchronously if async
is set, otherwise stream
is synchronized.
CUDA.memset
— Functionmemset(mem::CuPtr, value::Union{UInt8,UInt16,UInt32}, len::Integer; [stream::CuStream])
Initialize device memory by copying val
for len
times.
Other
CUDA.free_memory
— Functionfree_memory()
Returns the free amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.total_memory
— Functiontotal_memory()
Returns the total amount of memory (in bytes), available for allocation by the CUDA context.
Stream Management
CUDA.CuStream
— TypeCuStream(; flags=STREAM_DEFAULT, priority=nothing)
Create a CUDA stream.
CUDA.isdone
— Methodisdone(s::CuStream)
Return false
if a stream is busy (has task running or queued) and true
if that stream is free.
CUDA.priority_range
— Functionpriority_range()
Return the valid range of stream priorities as a StepRange
(with step size 1). The lower bound of the range denotes the least priority (typically 0), with the upper bound representing the greatest possible priority (typically -1).
CUDA.priority
— Functionpriority_range(s::CuStream)
Return the priority of a stream s
.
CUDA.synchronize
— Methodsynchronize([stream::CuStream])
Wait until stream
has finished executing, with stream
defaulting to the stream associated with the current Julia task.
See also: device_synchronize
CUDA.@sync
— Macro@sync [blocking=false] ex
Run expression ex
and synchronize the GPU afterwards.
The blocking
keyword argument determines how synchronization is performed. By default, non-blocking synchronization will be used, which gives other Julia tasks a chance to run while waiting for the GPU to finish. This may increase latency, so for short operations, or when benchmaring code that does not use multiple tasks, it may be beneficial to use blocking synchronization instead by setting blocking=true
. Blocking synchronization can also be enabled globally by changing the nonblocking_synchronization
preference.
See also: synchronize
.
For specific use cases, special streams are available:
CUDA.default_stream
— Functiondefault_stream()
Return the default stream.
It is generally better to use stream()
to get a stream object that's local to the current task. That way, operations scheduled in other tasks can overlap.
CUDA.legacy_stream
— Functionlegacy_stream()
Return a special object to use use an implicit stream with legacy synchronization behavior.
You can use this stream to perform operations that should block on all streams (with the exception of streams created with STREAM_NON_BLOCKING
). This matches the old pre-CUDA 7 global stream behavior.
CUDA.per_thread_stream
— Functionper_thread_stream()
Return a special object to use an implicit stream with per-thread synchronization behavior. This stream object is normally meant to be used with APIs that do not have per-thread versions of their APIs (i.e. without a ptsz
or ptds
suffix).
It is generally not needed to use this type of stream. With CUDA.jl, each task already gets its own non-blocking stream, and multithreading in Julia is typically accomplished using tasks.
Event Management
CUDA.CuEvent
— TypeCuEvent()
Create a new CUDA event.
CUDA.record
— Functionrecord(e::CuEvent, [stream::CuStream])
Record an event on a stream.
CUDA.synchronize
— Methodsynchronize(e::CuEvent)
Waits for an event to complete.
CUDA.isdone
— Methodisdone(e::CuEvent)
Return false
if there is outstanding work preceding the most recent call to record(e)
and true
if all captured work has been completed.
CUDA.wait
— Methodwait(e::CuEvent, [stream::CuStream])
Make a stream wait on a event. This only makes the stream wait, and not the host; use synchronize(::CuEvent)
for that.
CUDA.elapsed
— Functionelapsed(start::CuEvent, stop::CuEvent)
Computes the elapsed time between two events (in seconds).
CUDA.@elapsed
— Macro@elapsed [blocking=false] ex
A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.
See also: @sync
.
Execution Control
CUDA.CuDim3
— TypeCuDim3(x)
CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for respectively the x
, y
and z
dimension. Unspecified dimensions default to 1
.
Often accepted as argument through the CuDim
type alias, eg. in the case of cudacall
or CUDA.launch
, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3
object.
CUDA.cudacall
— Functioncudacall(f, types, values...; blocks::CuDim, threads::CuDim,
cooperative=false, shmem=0, stream=stream())
ccall
-like interface for launching a CUDA function f
on a GPU.
For example:
vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = alloc(CUDA.DeviceMemory, 10*sizeof(Float32))
cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))
The blocks
and threads
arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types
argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.
CUDA.launch
— Functionlaunch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
cooperative=false, shmem=0, stream=stream())
Low-level call to launch a CUDA function f
on the GPU, using blocks
and threads
as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem
, and the kernel is launched on stream stream
.
Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.
This is a low-level call, prefer to use cudacall
instead.
launch(exec::CuGraphExec, [stream::CuStream])
Launches an executable graph, by default in the currently-active stream.
Profiler Control
CUDA.@profile
— Macro@profile [trace=false] [raw=false] code...
@profile external=true code...
Profile the GPU execution of code
.
There are two modes of operation, depending on whether external
is true
or false
. The default value depends on whether Julia is being run under an external profiler.
Integrated profiler (external=false
, the default)
In this mode, CUDA.jl will profile the execution of code
and display the result. By default, a summary of host and device-side execution will be show, including any NVTX events. To display a chronological trace of the captured activity instead, trace
can be set to true
. Trace output will include an ID column that can be used to match host-side and device-side activity. If raw
is true
, all data will always be included, even if it may not be relevant. The output will be written to io
, which defaults to stdout
.
Slow operations will be highlighted in the output: Entries colored in yellow are among the slowest 25%, while entries colored in red are among the slowest 5% of all operations.
!!! compat "Julia 1.9" This functionality is only available on Julia 1.9 and later.
!!! compat "CUDA 11.2" Older versions of CUDA, before 11.2, contain bugs that may prevent the CUDA.@profile
macro to work. It is recommended to use a newer runtime.
External profilers (external=true
, when an external profiler is detected)
For more advanced profiling, it is possible to use an external profiling tool, such as NSight Systems or NSight Compute. When doing so, it is often advisable to only enable the profiler for the specific code region of interest. This can be done by wrapping the code with CUDA.@profile external=true
, which used to be the only way to use this macro.
CUDA.Profile.start
— Functionstart()
Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.
CUDA.Profile.stop
— Functionstop()
Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.
Texture Memory
Textures are represented by objects of type CuTexture
which are bound to some underlying memory, either CuArray
s or CuTextureArray
s:
CUDA.CuTexture
— TypeCuTexture{T,N,P}
N
-dimensional texture object with elements of type T
. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture
type.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)
Construct a N
-dimensional texture object with elements of type T
as stored in parent
.
Several keyword arguments alter the behavior of texture objects:
address_mode
(wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple ofN
entries.interpolation
(nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.normalized_coordinates
(true, false): whether indices are expected to fall in the normalized[0:1)
range.
!!! warning Experimental API. Subject to change without deprecation.
CuTexture(x::CuTextureArray{T,N})
Create a N
-dimensional texture object withelements of type T
that will be read from x
.
Experimental API. Subject to change without deprecation.
CuTexture(x::CuArray{T,N})
Create a N
-dimensional texture object that reads from a CuArray
.
Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.
Experimental API. Subject to change without deprecation.
You can create CuTextureArray
objects from both host and device memory:
CUDA.CuTextureArray
— TypeCuTextureArray{T,N}(undef, dims)
N
-dimensional dense texture array with elements of type T
. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P}
objects.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray(A::AbstractArray)
Allocate and initialize a texture array from host memory in A
.
Experimental API. Subject to change without deprecation.
CuTextureArray(A::CuArray)
Allocate and initialize a texture array from device memory in A
.
Experimental API. Subject to change without deprecation.
Occupancy API
The occupancy API can be used to figure out an appropriate launch configuration for a compiled kernel (represented as a CuFunction
) on the current device:
CUDA.launch_configuration
— Functionlaunch_configuration(fun::CuFunction; shmem=0, max_threads=0)
Calculate a suggested launch configuration for kernel fun
requiring shmem
bytes of dynamic shared memory. Returns a tuple with a suggested amount of threads, and the minimal amount of blocks to reach maximal occupancy. Optionally, the maximum amount of threads can be constrained using max_threads
.
In the case of a variable amount of shared memory, pass a callable object for shmem
instead, taking a single integer representing the block size and returning the amount of dynamic shared memory for that configuration.
CUDA.active_blocks
— Functionactive_blocks(fun::CuFunction, threads; shmem=0)
Calculate the maximum number of active blocks per multiprocessor when running threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
CUDA.occupancy
— Functionoccupancy(fun::CuFunction, threads; shmem=0)
Calculate the theoretical occupancy of launching threads
threads of a kernel fun
requiring shmem
bytes of dynamic shared memory.
Graph Execution
CUDA graphs can be easily recorded and executed using the high-level @captured
macro:
CUDA.@captured
— Macrofor ...
@captured begin
# code that executes several kernels or CUDA operations
end
end
A convenience macro for recording a graph of CUDA operations and automatically cache and update the execution. This can improve performance when executing kernels in a loop, where the launch overhead might dominate the execution.
For this to be effective, the kernels and operations executed inside of the captured region should not signficantly change across iterations of the loop. It is allowed to, e.g., change kernel arguments or inputs to operations, as this will be processed by updating the cached executable graph. However, significant changes will result in an instantiation of the graph from scratch, which is an expensive operation.
See also: capture
.
Low-level operations are available too:
CUDA.CuGraph
— TypeCuGraph([flags])
Create an empty graph for use with low-level graph operations. If you want to create a graph while directly recording operations, use capture
. For a high-level interface that also automatically executes the graph, use the @captured
macro.
CUDA.capture
— Functioncapture([flags], [throw_error::Bool=true]) do
...
end
Capture a graph of CUDA operations. The returned graph can then be instantiated and executed repeatedly for improved performance.
Note that many operations, like initial kernel compilation or memory allocations, cannot be captured. To work around this, you can set the throw_error
keyword to false, which will cause this function to return nothing
if such a failure happens. You can then try to evaluate the function in a regular way, and re-record afterwards.
See also: instantiate
.
CUDA.instantiate
— Functioninstantiate(graph::CuGraph)
Creates an executable graph from a graph. This graph can then be launched, or updated with an other graph.
CUDA.launch
— Methodlaunch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
cooperative=false, shmem=0, stream=stream())
Low-level call to launch a CUDA function f
on the GPU, using blocks
and threads
as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem
, and the kernel is launched on stream stream
.
Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.
This is a low-level call, prefer to use cudacall
instead.
launch(exec::CuGraphExec, [stream::CuStream])
Launches an executable graph, by default in the currently-active stream.
CUDA.update
— Functionupdate(exec::CuGraphExec, graph::CuGraph; [throw_error::Bool=true])
Check whether an executable graph can be updated with a graph and perform the update if possible. Returns a boolean indicating whether the update was successful. Unless throw_error
is set to false, also throws an error if the update failed.