Kernel programming

This section lists the package's public functionality that corresponds to special CUDA functions for use in device code. It is loosely organized according to the C/C++ language extensions appendix from the CUDA programming guide. For more information about certain intrinsics, refer to the aforementioned NVIDIA documentation.

Indexing and dimensions

Differences with corresponding C/C++ indexing variables

The indexing functions blockIdx and threadIdx have different starting index from the corresponding variables in the C/C++ extensions. Be careful when literally porting code from C/C++.

CUDACore.gridDim — Function

gridDim()::NamedTuple

Returns the dimensions (in blocks) of the grid as a NamedTuple with keys x, y, and z. Unlike the *Idx intrinsics, gridDim returns the same value as its C/C++ extension counterpart.

CUDACore.clusterIdx — Function

clusterIdx()::NamedTuple

Returns the cluster index within the grid as a NamedTuple with keys x, y, and z. These indices are 1-based.

CUDACore.clusterDim — Function

clusterDim()::NamedTuple

Returns the dimensions (in blocks) of the cluster as a NamedTuple with keys x, y, and z.

CUDACore.blockIdxInCluster — Function

blockIdxInCluster()::NamedTuple

Returns the block index within the cluster as a NamedTuple with keys x, y, and z. These indices are 1-based.

CUDACore.linearBlockIdxInCluster — Function

linearBlockIdxInCluster()::Int32

Returns the linear block index within the cluster. These indices are 1-based.

CUDACore.blockIdx — Function

blockIdx()::NamedTuple

Returns the block index within the grid as a NamedTuple with keys x, y, and z. These indices are 1-based, unlike the blockIdx built-in variable in the C/C++ extension which is 0-based.

CUDACore.blockDim — Function

blockDim()::NamedTuple

Returns the dimensions (in threads) of the block as a NamedTuple with keys x, y, and z. Unlike the *Idx intrinsics, blockDim returns the same value as its C/C++ extension counterpart.

CUDACore.threadIdx — Function

threadIdx()::NamedTuple

Returns the thread index within the block as a NamedTuple with keys x, y, and z. These indices are 1-based, unlike the threadIdx built-in variable in the C/C++ extension which is 0-based.

CUDACore.warpsize — Function

warpsize()::Int32

Returns the warp size (in threads). This corresponds to the warpSize built-in variable in the C/C++ extension.

CUDACore.laneid — Function

laneid()::Int32

Returns the thread's lane within the warp. This ID is 1-based.

CUDACore.lanemask — Function

lanemask(pred)::UInt32

Returns a 32-bit mask indicating which threads in a warp satisfy the given predicate. Supported predicates are ==, <, <=, >=, and >.

CUDACore.active_mask — Function

active_mask()

Returns a 32-bit mask indicating which threads in a warp are active with the current executing thread.

CUDACore.FULL_MASK — Constant

FULL_MASK

A 32-bit mask indicating that all threads in a warp are active.

Device arrays

CUDA.jl provides a primitive, lightweight array type to manage GPU data organized in an plain, dense fashion. This is the device-counterpart to the CuArray, and implements (part of) the array interface as well as other functionality for use on the GPU:

CUDACore.CuDeviceArray — Type

CuDeviceArray{T,N,A}(ptr, dims, [maxsize])

Construct an N-dimensional dense CUDA device array with element type T wrapping a pointer, where N is determined from the length of dims and T is determined from the type of ptr. dims may be a single scalar, or a tuple of integers corresponding to the lengths in each dimension). If the rank N is supplied explicitly as in Array{T,N}(dims), then it must match the length of dims. The same applies to the element type T, which should match the type of the pointer ptr.

CUDACore.Const — Type

Const(A::CuDeviceArray)

Mark a CuDeviceArray as constant/read-only. The invariant guaranteed is that you will not modify an CuDeviceArray for the duration of the current kernel.

This API can only be used on devices with compute capability 3.5 or higher.

Warning

Experimental API. Subject to change without deprecation.

Memory types

Shared memory

CUDACore.CuStaticSharedArray — Function

CuStaticSharedArray(T::Type, dims) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a statically-allocated piece of shared memory. The type should be statically inferable and the dimensions should be constant, or an error will be thrown and the generator function will be called dynamically.

CUDACore.CuDynamicSharedArray — Function

CuDynamicSharedArray(T::Type, dims, offset::Integer=0) -> CuDeviceArray{T,N,AS.Shared}

Get an array of type T and dimensions dims (either an integer length or tuple shape) pointing to a dynamically-allocated piece of shared memory. The type should be statically inferable or an error will be thrown and the generator function will be called dynamically.

Note that the amount of dynamic shared memory needs to specified when launching the kernel.

Optionally, an offset parameter indicating how many bytes to add to the base shared memory pointer can be specified. This is useful when dealing with a heterogeneous buffer of dynamic shared memory; in the case of a homogeneous multi-part buffer it is preferred to use view.

Texture memory

CUDACore.CuDeviceTexture — Type

CuDeviceTexture{T,N,M,NC,I}

N-dimensional device texture with elements of type T. This type is the device-side counterpart of CuTexture{T,N,P}, and can be used to access textures using regular indexing notation. If NC is true, indices used by these accesses should be normalized, i.e., fall into the [0,1) domain. The I type parameter indicates the kind of interpolation that happens when indexing into this texture. The source memory of the texture is specified by the M parameter, either linear memory or a texture array.

Device-side texture objects cannot be created directly, but should be created host-side using CuTexture{T,N,P} and passed to the kernel as an argument.

Warning

Experimental API. Subject to change without deprecation.

Synchronization

CUDACore.sync_threads — Function

sync_threads()

Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to sync_threads() are visible to all threads in the block.

CUDACore.sync_threads_count — Function

sync_threads_count(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to true.

CUDACore.sync_threads_and — Function

sync_threads_and(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for all of them.

CUDACore.sync_threads_or — Function

sync_threads_or(predicate)

Identical to sync_threads() with the additional feature that it evaluates predicate for all threads of the block and returns true if and only if predicate evaluates to true for any of them.

CUDACore.sync_warp — Function

sync_warp(mask::Integer=FULL_MASK)

Waits threads in the warp, selected by means of the bitmask mask, have reached this point and all global and shared memory accesses made by these threads prior to sync_warp() are visible to those threads in the warp. The default value for mask selects all threads in the warp.

Note

Requires CUDA >= 9.0 and sm_6.2

CUDACore.threadfence_block — Function

threadfence_block()

A memory fence that ensures that:

All writes to all memory made by the calling thread before the call to threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to threadfence_block()
All reads from all memory made by the calling thread before the call to threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to threadfence_block().

CUDACore.threadfence — Function

threadfence()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that no writes to all memory made by the calling thread after the call to threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to threadfence().

Note that for this ordering guarantee to be true, the observing threads must truly observe the memory and not cached versions of it; this is requires the use of volatile loads and stores, which is not available from Julia right now.

CUDACore.threadfence_system — Function

threadfence_system()

A memory fence that acts as threadfence_block for all threads in the block of the calling thread and also ensures that all writes to all memory made by the calling thread before the call to threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to threadfence_system().

Time functions

CUDACore.clock — Function

clock(UInt32)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

clock(UInt64)

Returns the value of a per-multiprocessor counter that is incremented every clock cycle.

CUDACore.nanosleep — Function

nanosleep(t)

Puts a thread for a given amount t(in nanoseconds).

Note

Requires CUDA >= 10.0 and sm_6.2

Warp-level functions

Voting

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input a boolean predicate from each thread in the warp and evaluate it. The results of that evaluation are combined (reduced) across the active threads of the warp in one different ways, broadcasting a single return value to each participating thread.

CUDACore.vote_all_sync — Function

vote_all_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for all of them.

CUDACore.vote_any_sync — Function

vote_any_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is true for any of them.

CUDACore.vote_uni_sync — Function

vote_uni_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return whether predicate is the same for any of them.

CUDACore.vote_ballot_sync — Function

vote_ballot_sync(mask::UInt32, predicate::Bool)

Evaluate predicate for all active threads of the warp and return an integer whose Nth bit is set if and only if predicate is true for the Nth thread of the warp and the Nth thread is active.

Shuffle

CUDACore.shfl_sync — Function

shfl_sync(threadmask::UInt32, val, lane::Integer, width::Integer=32)

Shuffle a value from a directly indexed lane lane, and synchronize threads according to threadmask.

CUDACore.shfl_up_sync — Function

shfl_up_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with lower ID relative to caller, and synchronize threads according to threadmask.

CUDACore.shfl_down_sync — Function

shfl_down_sync(threadmask::UInt32, val, delta::Integer, width::Integer=32)

Shuffle a value from a lane with higher ID relative to caller, and synchronize threads according to threadmask.

CUDACore.shfl_xor_sync — Function

shfl_xor_sync(threadmask::UInt32, val, mask::Integer, width::Integer=32)

Shuffle a value from a lane based on bitwise XOR of own lane ID with mask, and synchronize threads according to threadmask.

Formatted Output

CUDACore.@cushow — Macro

@cushow(ex)

GPU analog of Base.@show. It comes with the same type restrictions as @cuprintf.

@cushow threadIdx().x

CUDACore.@cuprint — Macro

@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"

CUDACore.@cuprintln — Macro

@cuprint(xs...)
@cuprintln(xs...)

Print a textual representation of values xs to standard output from the GPU. The functionality builds on @cuprintf, and is intended as a more use friendly alternative of that API. However, that also means there's only limited support for argument types, handling 16/32/64 signed and unsigned integers, 32 and 64-bit floating point numbers, Cchars and pointers. For more complex output, use @cuprintf directly.

Limited string interpolation is also possible:

    @cuprint("Hello, World ", 42, "\n")
    @cuprint "Hello, World $(42)\n"

CUDACore.@cuprintf — Macro

@cuprintf("%Fmt", args...)

Print a formatted string in device context on the host standard output.

Note that this is not a fully C-compliant printf implementation; see the CUDA documentation for supported options and inputs.

Also beware that it is an untyped, and unforgiving printf implementation. Type widths need to match, eg. printing a 64-bit Julia integer requires the %ld formatting string.

Assertions

CUDACore.@cuassert — Macro

@assert cond [text]

Signal assertion failure to the CUDA driver if cond is false. Preferred syntax for writing assertions, mimicking Base.@assert. Message text is optionally displayed upon assertion failure.

Warning

A failed assertion will crash the GPU, so use sparingly as a debugging tool. Furthermore, the assertion might be disabled at various optimization levels, and thus should not cause any side-effects.

Atomics

A high-level macro is available to annotate expressions with:

CUDACore.@atomic — Macro

@atomic a[I] = op(a[I], val)
@atomic a[I] ...= val

Atomically perform a sequence of operations that loads an array element a[I], performs the operation op on that value and a second value val, and writes the result back to the array. This sequence can be written out as a regular assignment, in which case the same array element should be used in the left and right hand side of the assignment, or as an in-place application of a known operator. In both cases, the array reference should be pure and not induce any side-effects.

Warn

This interface is experimental, and might change without warning. Use the lower-level atomic_...! functions for a stable API, albeit one limited to natively-supported ops.

If your expression is not recognized, or you need more control, use the underlying functions:

CUDACore.atomic_cas! — Function

atomic_cas!(ptr::LLVMPtr{T}, cmp::T, val::T)

Reads the value old located at address ptr and compare with cmp. If old equals to cmp, stores val at the same address. Otherwise, doesn't change the value old. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64. Additionally, on GPU hardware with compute capability 7.0+, values of type UInt16 are supported.

CUDACore.atomic_xchg! — Function

atomic_xchg!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr and stores val at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_add! — Function

atomic_add!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old + val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32, UInt64, and Float32. Additionally, on GPU hardware with compute capability 6.0+, values of type Float64 are supported.

CUDACore.atomic_sub! — Function

atomic_sub!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old - val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_and! — Function

atomic_and!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old & val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_or! — Function

atomic_or!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old | val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_xor! — Function

atomic_xor!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes old ⊻ val, and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_min! — Function

atomic_min!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes min(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_max! — Function

atomic_max!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes max(old, val), and stores the result back to memory at the same address. These operations are performed in one atomic transaction. The function returns old.

This operation is supported for values of type Int32, Int64, UInt32 and UInt64.

CUDACore.atomic_inc! — Function

atomic_inc!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

CUDACore.atomic_dec! — Function

atomic_dec!(ptr::LLVMPtr{T}, val::T)

Reads the value old located at address ptr, computes (((old == 0) | (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.

This operation is only supported for values of type Int32.

Dynamic parallelism

Similarly to launching kernels from the host, you can use @cuda while passing dynamic=true for launching kernels from the device. A lower-level API is available as well:

CUDACore.dynamic_cufunction — Function

dynamic_cufunction(f, tt=Tuple{})

Low-level interface to compile a function invocation for the currently-active GPU, returning a callable kernel object. Device-side equivalent of CUDACore.cufunction.

No keyword arguments are supported.

CUDACore.DeviceKernel — Type

(::HostKernel)(args...; kwargs...)
(::DeviceKernel)(args...; kwargs...)

Low-level interface to call a compiled kernel, passing GPU-compatible arguments in args. For a higher-level interface, use @cuda.

A HostKernel is callable on the host, and a DeviceKernel is callable on the device (created by @cuda with dynamic=true).

The following keyword arguments are supported:

threads (default: 1): Number of threads per block, or a 1-, 2- or 3-tuple of dimensions (e.g. threads=(32, 32) for a 2D block of 32×32 threads). Use threadIdx() and blockDim() to query from within the kernel.
blocks (default: 1): Number of thread blocks to launch, or a 1-, 2- or 3-tuple of dimensions (e.g. blocks=(2, 4, 2) for a 3D grid of blocks). Use blockIdx() and gridDim() to query from within the kernel.
clustersize (default: 1): Number of thread blocks to launch as a cooperative cluster, or a 1-, 2- or 3-tuple of dimensions (e.g. clustersize=(2, 2, 2) for a 3D grid). Use clusterIdx() and clusterDim() to query from within the kernel. Only supported on compute capability 9.0 and above. If clustersize=1, no clusters are launched.
shmem(default: 0): Amount of dynamic shared memory in bytes to allocate per thread block; used by CuDynamicSharedArray.
stream (default: stream()): CuStream to launch the kernel on.
cooperative (default: false): whether to launch a cooperative kernel that supports grid synchronization (see CG.this_grid and CG.sync). Note that this requires care wrt. the number of blocks launched.

Cooperative groups

CUDACore.CG — Module

CUDA.jl's cooperative groups implementation.

Cooperative groups in CUDA offer a structured approach to synchronize and communicate among threads. They allow developers to define specific groups of threads, providing a means to fine-tune inter-thread communication granularity. By offering a more nuanced alternative to traditional CUDA synchronization methods, cooperative groups enable a more controlled and efficient parallel decomposition in kernel design.

The following functionality is available in CUDA.jl:

implicit groups: thread blocks, grid groups, and coalesced groups.
synchronization: sync, barrier_arrive, barrier_wait
warp collectives for coalesced groups: shuffle and voting
data transfer: memcpy_async, wait and wait_prior

Noteworthy missing functionality:

implicit groups: clusters, and multi-grid groups (which are deprecated)
explicit groups: tiling and partitioning

Group construction and properties

CUDACore.CG.thread_rank — Function

thread_rank(group)

Returns the linearized rank of the calling thread along the interval [1, num_threads()].

CUDACore.CG.num_threads — Function

num_threads(group)

Returns the total number of threads in the group.

CUDACore.CG.thread_block — Type

thread_block <: thread_group

Every GPU kernel is executed by a grid of thread blocks, and threads within each block are guaranteed to reside on the same streaming multiprocessor. A thread_block represents a thread block whose dimensions are not known until runtime.

Constructed via this_thread_block

CUDACore.CG.this_thread_block — Function

this_thread_block()

Constructs a thread_block group

CUDACore.CG.group_index — Function

group_index(tb::thread_block)

3-Dimensional index of the block within the launched grid.

CUDACore.CG.thread_index — Function

thread_index(tb::thread_block)

3-Dimensional index of the thread within the launched block.

CUDACore.CG.dim_threads — Function

dim_threads(tb::thread_block)

Dimensions of the launched block in units of threads.

CUDACore.CG.grid_group — Type

grid_group <: thread_group

Threads within this this group are guaranteed to be co-resident on the same device within the same launched kernel. To use this group, the kernel must have been launched with @cuda cooperative=true, and the device must support it (queryable device attribute).

Constructed via this_grid.

CUDACore.CG.this_grid — Function

this_grid()

Constructs a grid_group.

CUDACore.CG.is_valid — Function

is_valid(gg::grid_group)

Returns whether the grid_group can synchronize

CUDACore.CG.block_rank — Function

block_rank(gg::grid_group)

Rank of the calling block within [0, num_blocks)

CUDACore.CG.num_blocks — Function

num_blocks(gg::grid_group)

Total number of blocks in the group.

CUDACore.CG.dim_blocks — Function

dim_blocks(gg::grid_group)

Dimensions of the launched grid in units of blocks.

CUDACore.CG.block_index — Function

block_index(gg::grid_group)

3-Dimensional index of the block within the launched grid.

CUDACore.CG.coalesced_group — Type

coalesced_group <: thread_group

A group representing the current set of converged threads in a warp. The size of the group is not guaranteed and it may return a group of only one thread (itself).

This group exposes warp-synchronous builtins. Constructed via coalesced_threads.

CUDACore.CG.coalesced_threads — Function

coalesced_threads()

Constructs a coalesced_group.

CUDACore.CG.meta_group_rank — Function

meta_group_rank(cg::coalesced_group)

Rank of this group in the upper level of the hierarchy.

CUDACore.CG.meta_group_size — Function

meta_group_size(cg::coalesced_group)

Total number of partitions created out of all CTAs when the group was created.

Synchronization

CUDACore.CG.sync — Function

sync(group)

Synchronize the threads named in the group, equivalent to calling barrier_wait and barrier_arrive in sequence.

CUDACore.CG.barrier_arrive — Function

barrier_arrive(group)

Arrive on the barrier, returns a token that needs to be passed into barrier_wait.

CUDACore.CG.barrier_wait — Function

barrier_wait(group, token)

Wait on the barrier, takes arrival token returned from barrier_arrive.

Data transfer

CUDACore.CG.wait — Function

wait(group)

Make all threads in this group wait for all previously submitted memcpy_async operations to complete.

CUDACore.CG.wait_prior — Function

wait_prior(group, stage)

Make all threads in this group wait for all but stage previously submitted memcpy_async operations to complete.

CUDACore.CG.memcpy_async — Function

memcpy_async(group, dst, src, bytes)

Perform a group-wide collective memory copy from src to dst of bytes bytes. This operation may be performed asynchronously, so you should wait or wait_prior before using the data. It is only supported by thread blocks and coalesced groups.

For this operation to be performed asynchronously, the following conditions must be met:

the source and destination memory should be aligned to 4, 8 or 16 bytes. this will be deduced from the datatype, but can also be specified explicitly using CUDACore.align.
the source should be global memory, and the destination should be shared memory.
the device should have compute capability 8.0 or higher.

Math

Many mathematical functions are provided by the libdevice library, and are wrapped by CUDA.jl. These functions are used to implement well-known functions from the Julia standard library and packages like SpecialFunctions.jl, e.g., calling the cos function will automatically use __nv_cos from libdevice if possible.

Some functions do not have a counterpart in the Julia ecosystem, those have to be called directly. For example, to call __nv_logb or __nv_logbf you use CUDA.logb in a kernel.

For a list of available functions, look at src/device/intrinsics/math.jl.

WMMA

Warp matrix multiply-accumulate (WMMA) is a CUDA API to access Tensor Cores, a new hardware feature in Volta GPUs to perform mixed precision matrix multiply-accumulate operations. The interface is split in two levels, both available in the WMMA submodule: low level wrappers around the LLVM intrinsics, and a higher-level API similar to that of CUDA C.

LLVM Intrinsics

Load matrix

CUDACore.WMMA.llvm_wmma_load — Function

WMMA.llvm_wmma_load_{matrix}_{layout}_{shape}_{addr_space}_stride_{elem_type}(src_addr, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.load.{matrix}.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

src_addr: The memory address to load from.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{matrix}: The matrix to load. Can be a, b or c.
{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Perform multiply-accumulate

CUDACore.WMMA.llvm_wmma_mma — Function

WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{d_elem_type}_{c_elem_type}(a, b, c) or
WMMA.llvm_wmma_mma_{a_layout}_{b_layout}_{shape}_{a_elem_type}(a, b, c)

For floating point operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{d_elem_type}.{c_elem_type} For all other operations: wrapper around the LLVM intrinsic @llvm.nvvm.wmma.mma.sync.{a_layout}.{b_layout}.{shape}.{a_elem_type}

Arguments

a: The WMMA fragment corresponding to the matrix $A$.
b: The WMMA fragment corresponding to the matrix $B$.
c: The WMMA fragment corresponding to the matrix $C$.

Placeholders

{a_layout}: The storage layout for matrix $A$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{b_layout}: The storage layout for matrix $B$. Can be row or col, for row major (C style) or column major (Julia style), respectively. Note that this must match the layout used in the load operation.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{a_elem_type}: The type of each element in the $A$ matrix. Valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point).
{d_elem_type}: The type of each element in the resultant $D$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).
{c_elem_type}: The type of each element in the $C$ matrix. Valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

Warning

Remember that the shape, type and layout of all operations (be it MMA, load or store) MUST match. Otherwise, the behaviour is undefined!

Store matrix

CUDACore.WMMA.llvm_wmma_store — Function

WMMA.llvm_wmma_store_d_{layout}_{shape}_{addr_space}_stride_{elem_type}(dst_addr, data, stride)

Wrapper around the LLVM intrinsic @llvm.nvvm.wmma.store.d.sync.{layout}.{shape}.{addr_space}.stride.{elem_type}.

Arguments

dst_addr: The memory address to store to.
data: The $D$ fragment to store.
stride: The leading dimension of the matrix, in numbers of elements.

Placeholders

{layout}: The storage layout for the matrix. Can be row or col, for row major (C style) or column major (Julia style), respectively.
{shape}: The overall shape of the MAC operation. Valid values are m16n16k16, m32n8k16, and m8n32k16.
{addr_space}: The address space of src_addr. Can be empty (generic addressing), shared or global.
{elem_type}: The type of each element in the matrix. For a and b matrices, valid values are u8 (byte unsigned integer), s8 (byte signed integer), and f16 (half precision floating point). For c and d matrices, valid values are s32 (32-bit signed integer), f16 (half precision floating point), and f32 (full precision floating point).

CUDA C-like API

Fragment

CUDACore.WMMA.RowMajor — Type

WMMA.RowMajor

Type that represents a matrix stored in row major (C style) order.

CUDACore.WMMA.ColMajor — Type

WMMA.ColMajor

Type that represents a matrix stored in column major (Julia style) order.

CUDACore.WMMA.Unspecified — Type

WMMA.Unspecified

Type that represents a matrix stored in an unspecified order.

Warning

This storage format is not valid for all WMMA operations!

CUDACore.WMMA.FragmentLayout — Type

WMMA.FragmentLayout

Abstract type that specifies the storage layout of a matrix.

Possible values are WMMA.RowMajor, WMMA.ColMajor and WMMA.Unspecified.

CUDACore.WMMA.Fragment — Type

WMMA.Fragment

Type that represents per-thread intermediate results of WMMA operations.

You can access individual elements using the x member or [] operator, but beware that the exact ordering of elements is unspecified.

WMMA configuration

CUDACore.WMMA.Config — Type

WMMA.Config{M, N, K, d_type}

Type that contains all information for WMMA operations that cannot be inferred from the argument's types.

WMMA instructions calculate the matrix multiply-accumulate operation $D = A \cdot B + C$, where $A$ is a $M \times K$ matrix, $B$ a $K \times N$ matrix, and $C$ and $D$ are $M \times N$ matrices.

d_type refers to the type of the elements of matrix $D$, and can be either Float16 or Float32.

All WMMA operations take a Config as their final argument.

Examples

julia> config = WMMA.Config{16, 16, 16, Float32}
CUDACore.WMMA.Config{16, 16, 16, Float32}

Load matrix

CUDACore.WMMA.load_a — Function

WMMA.load_a(addr, stride, layout, config)
WMMA.load_b(addr, stride, layout, config)
WMMA.load_c(addr, stride, layout, config)

Load the matrix a, b or c from the memory location indicated by addr, and return the resulting WMMA.Fragment.

Arguments

addr: The address to load the matrix from.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for loading this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the load operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

WMMA.load_b and WMMA.load_c have the same signature.

Perform multiply-accumulate

CUDACore.WMMA.mma — Function

WMMA.mma(a, b, c, conf)

Perform the matrix multiply-accumulate operation $D = A \cdot B + C$.

Arguments

a: The WMMA.Fragment corresponding to the matrix $A$.
b: The WMMA.Fragment corresponding to the matrix $B$.
c: The WMMA.Fragment corresponding to the matrix $C$.
conf: The WMMA.Config that should be used in this WMMA operation.

Warning

All threads in a warp MUST execute the mma operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

Store matrix

CUDACore.WMMA.store_d — Function

WMMA.store_d(addr, d, stride, layout, config)

Store the result matrix d to the memory location indicated by addr.

Arguments

addr: The address to store the matrix to.
d: The WMMA.Fragment corresponding to the d matrix.
stride: The leading dimension of the matrix pointed to by addr, specified in number of elements.
layout: The storage layout of the matrix. Possible values are WMMA.RowMajor and WMMA.ColMajor.
config: The WMMA configuration that should be used for storing this matrix. See WMMA.Config.

See also: WMMA.Fragment, WMMA.FragmentLayout, WMMA.Config

Warning

All threads in a warp MUST execute the store operation in lockstep, and have to use exactly the same arguments. Failure to do so will result in undefined behaviour.

Fill fragment

CUDACore.WMMA.fill_c — Function

WMMA.fill_c(value, config)

Return a WMMA.Fragment filled with the value value.

This operation is useful if you want to implement a matrix multiplication (and thus want to set $C = O$).

Arguments

value: The value used to fill the fragment. Can be a Float16 or Float32.
config: The WMMA configuration that should be used for this WMMA operation. See WMMA.Config.

Other

CUDACore.align — Type

CUDA.align{N}(obj)

Construct an aligned object, providing alignment information to APIs that require it.