CUDA driver
This section lists the package's public functionality that directly corresponds to functionality of the CUDA driver API. In general, the abstractions stay close to those of the CUDA driver API, so for more information on certain library calls you can consult the CUDA driver API reference.
The documentation is grouped according to the modules of the driver API.
Error Handling
CUDA.CuError
— TypeCuError(code)
CuError(code, meta)
Create a CUDA error object with error code code
. The optional meta
parameter indicates whether extra information, such as error logs, is known.
CUDA.name
— Methodname(err::CuError)
Gets the string representation of an error code.
julia> err = CuError(CUDA.cudaError_enum(1))
CuError(CUDA_ERROR_INVALID_VALUE)
julia> name(err)
"ERROR_INVALID_VALUE"
CUDA.description
— Methoddescription(err::CuError)
Gets the string description of an error code.
Version Management
CUDA.version
— Methodversion()
Returns the CUDA version as reported by the driver.
Device Management
CUDA.CuDevice
— TypeCuDevice(i::Integer)
Get a handle to a compute device.
CUDA.devices
— Functiondevices()
Get an iterator for the compute devices.
CUDA.name
— Methodname(dev::CuDevice)
Returns an identifier string for the device.
CUDA.totalmem
— Methodtotalmem(dev::CuDevice)
Returns the total amount of memory (in bytes) on the device.
CUDA.attribute
— Functionattribute(dev::CuDevice, code)
Returns information about the device.
attribute(X, ptr::Union{Ptr,CuPtr}, attr)
Returns attribute attr
about pointer ptr
. The type of the returned value depends on the attribute, and as such must be passed as the X
parameter.
Certain common attributes are exposed by additional convenience functions:
CUDA.capability
— Methodcapability(dev::CuDevice)
Returns the compute capability of the device.
CUDA.warpsize
— Methodwarpsize(dev::CuDevice)
Returns the warp size (in threads) of the device.
Context Management
CUDA.CuContext
— TypeCuContext(dev::CuDevice, flags=CTX_SCHED_AUTO)
CuContext(f::Function, ...)
Create a CUDA context for device. A context on the GPU is analogous to a process on the CPU, with its own distinct address space and allocated resources. When a context is destroyed, the system cleans up the resources allocated to it.
When you are done using the context, call CUDA.unsafe_destroy!
to mark it for deletion, or use do-block syntax with this constructor.
CUDA.unsafe_destroy!
— Methodunsafe_destroy!(ctx::CuContext)
Immediately destroy a context, freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
CUDA.CuCurrentContext
— FunctionCuCurrentContext()
Return the current context, or nothing
if there is no active context.
CUDA.activate
— Methodactivate(ctx::CuContext)
Binds the specified CUDA context to the calling CPU thread.
CUDA.synchronize
— Methodsynchronize()
Block for the current context's tasks to complete.
Primary Context Management
CUDA.CuPrimaryContext
— TypeCuPrimaryContext(dev::CuDevice)
Create a primary CUDA context for a given device.
Each primary context is unique per device and is shared with CUDA runtime API. It is meant for interoperability with (applications using) the runtime API.
CUDA.CuContext
— MethodCuContext(pctx::CuPrimaryContext)
Retain the primary context on the GPU, returning a context compatible with the driver API. The primary context will be released when the returned driver context is finalized.
As these contexts are refcounted by CUDA, you should not call CUDA.unsafe_destroy!
on them but use CUDA.unsafe_release!
instead (available with do-block syntax as well).
CUDA.isactive
— Methodisactive(pctx::CuPrimaryContext)
Query whether a primary context is active.
CUDA.flags
— Methodflags(pctx::CuPrimaryContext)
Query the flags of a primary context.
CUDA.setflags!
— Methodsetflags!(pctx::CuPrimaryContext)
Set the flags of a primary context.
CUDA.unsafe_reset!
— Methodunsafe_reset!(pctx::CuPrimaryContext)
Explicitly destroys and cleans up all resources associated with a device's primary context in the current process. Note that this forcibly invalidates all contexts derived from this primary context, and as a result outstanding resources might become invalid.
CUDA.unsafe_release!
— MethodCUDA.unsafe_release!(ctx::CuContext)
Lower the refcount of a context, possibly freeing up all resources associated with it. This does not respect any users of the context, and might make other objects unusable.
Module Management
CUDA.CuModule
— TypeCuModule(data, options::Dict{CUjit_option,Any})
CuModuleFile(path, options::Dict{CUjit_option,Any})
Create a CUDA module from a data, or a file containing data. The data may be PTX code, a CUBIN, or a FATBIN.
The options
is an optional dictionary of JIT options and their respective value.
Function Management
CUDA.CuFunction
— TypeCuFunction(mod::CuModule, name::String)
Acquires a function handle from a named function in a module.
Global Variable Management
CUDA.CuGlobal
— TypeCuGlobal{T}(mod::CuModule, name::String)
Acquires a typed global variable handle from a named global in a module.
Base.eltype
— Methodeltype(var::CuGlobal)
Return the element type of a global variable object.
Base.getindex
— MethodBase.getindex(var::CuGlobal)
Return the current value of a global variable.
Base.setindex!
— MethodBase.setindex(var::CuGlobal{T}, val::T)
Set the value of a global variable to val
Linker
CUDA.CuLink
— TypeCuLink()
Creates a pending JIT linker invocation.
CUDA.add_data!
— Functionadd_data!(link::CuLink, name::String, code::String)
Add PTX code to a pending link operation.
add_data!(link::CuLink, name::String, data::Vector{UInt8}, type::CUjitInputType)
Add object code to a pending link operation.
CUDA.add_file!
— Functionadd_file!(link::CuLink, path::String, typ::CUjitInputType)
Add data from a file to a link operation. The argument typ
indicates the type of the contained data.
CUDA.CuLinkImage
— TypeThe result of a linking operation.
This object keeps its parent linker object alive, as destroying a linker destroys linked images too.
CUDA.complete
— Functioncomplete(link::CuLink)
Complete a pending linker invocation, returning an output image.
CUDA.CuModule
— MethodCuModule(img::CuLinkImage, ...)
Create a CUDA module from a completed linking operation. Options from CuModule
apply.
Memory Management
Three kinds of memory buffers can be allocated: device memory, host memory, and unified memory. Each of these buffers can be allocated by calling alloc
with the type of buffer as first argument, and freed by calling free
. Certain buffers have specific methods defined.
CUDA.Mem.DeviceBuffer
— TypeMem.DeviceBuffer
Mem.Device
A buffer of device memory residing on the GPU.
CUDA.Mem.alloc
— MethodMem.alloc(DeviceBuffer, bytesize::Integer)
Allocate bytesize
bytes of memory on the device. This memory is only accessible on the GPU, and requires explicit calls to unsafe_copyto!
, which wraps cuMemcpy
, for access on the CPU.
CUDA.Mem.HostBuffer
— TypeMem.HostBuffer
Mem.Host
A buffer of pinned memory on the CPU, possibly accessible on the GPU.
CUDA.Mem.alloc
— MethodMem.alloc(HostBuffer, bytesize::Integer, [flags])
Allocate bytesize
bytes of page-locked memory on the host. This memory is accessible from the CPU, and makes it possible to perform faster memory copies to the GPU. Furthermore, if flags
is set to HOSTALLOC_DEVICEMAP
the memory is also accessible from the GPU. These accesses are direct, and go through the PCI bus. If flags
is set to HOSTALLOC_PORTABLE
, the memory is considered mapped by all CUDA contexts, not just the one that created the memory, which is useful if the memory needs to be accessed from multiple devices. Multiple flags
can be set at one time using a bytewise OR
:
flags = HOSTALLOC_PORTABLE | HOSTALLOC_DEVICEMAP
CUDA.Mem.register
— MethodMem.register(HostBuffer, ptr::Ptr, bytesize::Integer, [flags])
Page-lock the host memory pointed to by ptr
. Subsequent transfers to and from devices will be faster, and can be executed asynchronously. If the HOSTREGISTER_DEVICEMAP
flag is specified, the buffer will also be accessible directly from the GPU. These accesses are direct, and go through the PCI bus. If the HOSTREGISTER_PORTABLE
flag is specified, any CUDA context can access the memory.
CUDA.Mem.unregister
— MethodMem.unregister(HostBuffer)
Unregisters a memory range that was registered with Mem.register
.
CUDA.Mem.UnifiedBuffer
— TypeMem.UnifiedBuffer
Mem.Unified
A managed buffer that is accessible on both the CPU and GPU.
CUDA.Mem.alloc
— MethodMem.alloc(UnifiedBuffer, bytesize::Integer, [flags::CUmemAttach_flags])
Allocate bytesize
bytes of unified memory. This memory is accessible from both the CPU and GPU, with the CUDA driver automatically copying upon first access.
CUDA.Mem.prefetch
— Methodprefetch(::UnifiedBuffer, [bytes::Integer]; [device::CuDevice], [stream::CuStream])
Prefetches memory to the specified destination device.
CUDA.Mem.advise
— Methodadvise(::UnifiedBuffer, advice::CUDA.CUmem_advise, [bytes::Integer]; [device::CuDevice])
Advise about the usage of a given memory range.
To work with these buffers, you need to convert
them to a Ptr
or CuPtr
. Several methods then work with these raw pointers:
Memory info
CUDA.available_memory
— Functionavailable_memory()
Returns the available_memory amount of memory (in bytes), available for allocation by the CUDA context.
CUDA.total_memory
— Functiontotal_memory()
Returns the total amount of memory (in bytes), available for allocation by the CUDA context.
Stream Management
CUDA.CuStream
— TypeCuStream(; flags=STREAM_DEFAULT, priority=nothing)
Create a CUDA stream.
CUDA.CuDefaultStream
— FunctionCuDefaultStream()
Return the default stream.
CUDA.synchronize
— Methodsynchronize(s::CuStream)
Wait until a stream's tasks are completed.
Event Management
CUDA.CuEvent
— TypeCuEvent()
Create a new CUDA event.
CUDA.record
— Functionrecord(e::CuEvent, stream=CuDefaultStream())
Record an event on a stream.
CUDA.synchronize
— Methodsynchronize(e::CuEvent)
Waits for an event to complete.
CUDA.elapsed
— Functionelapsed(start::CuEvent, stop::CuEvent)
Computes the elapsed time between two events (in seconds).
CUDA.@elapsed
— Macro@elapsed stream ex
@elapsed ex
A macro to evaluate an expression, discarding the resulting value, instead returning the number of seconds it took to execute on the GPU, as a floating-point number.
Execution Control
CUDA.CuDim3
— TypeCuDim3(x)
CuDim3((x,))
CuDim3((x, y))
CuDim3((x, y, x))
A type used to specify dimensions, consisting of 3 integers for respectively the x
, y
and z
dimension. Unspecified dimensions default to 1
.
Often accepted as argument through the CuDim
type alias, eg. in the case of cudacall
or CUDA.launch
, allowing to pass dimensions as a plain integer or a tuple without having to construct an explicit CuDim3
object.
CUDA.cudacall
— Functioncudacall(f::CuFunction, types, values...; blocks::CuDim, threads::CuDim,
cooperative=false, shmem=0, stream=CuDefaultStream())
ccall
-like interface for launching a CUDA function f
on a GPU.
For example:
vadd = CuFunction(md, "vadd")
a = rand(Float32, 10)
b = rand(Float32, 10)
ad = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(ad, convert(Ptr{Cvoid}, a), 10*sizeof(Float32)))
bd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
unsafe_copyto!(bd, convert(Ptr{Cvoid}, b), 10*sizeof(Float32)))
c = zeros(Float32, 10)
cd = Mem.alloc(DeviceBuffer, 10*sizeof(Float32))
cudacall(vadd, (CuPtr{Cfloat},CuPtr{Cfloat},CuPtr{Cfloat}), ad, bd, cd; threads=10)
unsafe_copyto!(convert(Ptr{Cvoid}, c), cd, 10*sizeof(Float32)))
The blocks
and threads
arguments control the launch configuration, and should both consist of either an integer, or a tuple of 1 to 3 integers (omitted dimensions default to 1). The types
argument can contain both a tuple of types, and a tuple type, the latter being slightly faster.
CUDA.launch
— Functionlaunch(f::CuFunction; args...; blocks::CuDim=1, threads::CuDim=1,
cooperative=false, shmem=0, stream=CuDefaultStream())
Low-level call to launch a CUDA function f
on the GPU, using blocks
and threads
as respectively the grid and block configuration. Dynamic shared memory is allocated according to shmem
, and the kernel is launched on stream stream
.
Arguments to a kernel should either be bitstype, in which case they will be copied to the internal kernel parameter buffer, or a pointer to device memory.
This is a low-level call, prefer to use cudacall
instead.
Profiler Control
CUDA.@profile
— Macro@profile ex
Run expressions while activating the CUDA profiler.
Note that this API is used to programmatically control the profiling granularity by allowing profiling to be done only on selective pieces of code. It does not perform any profiling on itself, you need external tools for that.
CUDA.Profile.start
— Functionstart()
Enables profile collection by the active profiling tool for the current context. If profiling is already enabled, then this call has no effect.
CUDA.Profile.stop
— Functionstop()
Disables profile collection by the active profiling tool for the current context. If profiling is already disabled, then this call has no effect.
Texture Memory
Textures are represented by objects of type CuTexture
which are bound to some underlying memory, either CuArray
s or CuTextureArray
s:
CUDA.CuTexture
— TypeCuTexture{T,N,P}
N
-dimensional texture object with elements of type T
. These objects do not store data themselves, but are bounds to another source of device memory. Texture objects can be passed to CUDA kernels, where they will be accessible through the CuDeviceTexture
type.
Experimental API. Subject to change without deprecation.
CUDA.CuTexture
— MethodCuTexture{T,N,P}(parent::P; address_mode, filter_mode, normalized_coordinates)
Construct a N
-dimensional texture object with elements of type T
as stored in parent
.
Several keyword arguments alter the behavior of texture objects:
address_mode
(wrap, clamp, mirror): how out-of-bounds values are accessed. Can be specified as a value for all dimensions, or as a tuple ofN
entries.interpolation
(nearest neighbour, linear, bilinear): how non-integral indices are fetched. Nearest-neighbour fetches a single value, others interpolate between multiple.normalized_coordinates
(true, false): whether indices are expected to fall in the normalized[0:1)
range.
!!! warning Experimental API. Subject to change without deprecation.
CuTexture(x::CuTextureArray{T,N})
Create a N
-dimensional texture object withelements of type T
that will be read from x
.
Experimental API. Subject to change without deprecation.
CuTexture(x::CuArray{T,N})
Create a N
-dimensional texture object that reads from a CuArray
.
Note that it is necessary the their memory is well aligned and strided (good pitch). Currently, that is not being enforced.
Experimental API. Subject to change without deprecation.
You can create CuTextureArray
objects from both host and device memory:
CUDA.CuTextureArray
— TypeCuTextureArray{T,N}(undef, dims)
N
-dimensional dense texture array with elements of type T
. These arrays are optimized for texture fetching, and are only meant to be used as a source for CuTexture{T,N,P}
objects.
Experimental API. Subject to change without deprecation.
CUDA.CuTextureArray
— MethodCuTextureArray{T,N}(undef, dims)
Construct an uninitialized texture array of N
dimensions specified in the dims
tuple, with elements of type T
. Use Base.copyto!
to initialize this texture array, or use constructors that take a non-texture array to do so automatically.
Experimental API. Subject to change without deprecation.
CuTextureArray(A::AbstractArray)
Allocate and initialize a texture buffer from host memory in A
.
Experimental API. Subject to change without deprecation.
CuTextureArray(A::CuArray)
Allocate and initialize a texture buffer from device memory in A
.
Experimental API. Subject to change without deprecation.