Array programming
The easiest way to use the GPU's massive parallelism, is by expressing operations in terms of arrays: CUDA.jl provides an array type, CuArray
, and many specialized array operations that execute efficiently on the GPU hardware. In this section, we will briefly demonstrate use of the CuArray
type. Since we expose CUDA's functionality by implementing existing Julia interfaces on the CuArray
type, you should refer to the upstream Julia documentation for more information on these operations.
If you encounter missing functionality, or are running into operations that trigger so-called "scalar iteration", have a look at the issue tracker and file a new issue if there's none. Do note that you can always access the underlying CUDA APIs by calling into the relevant submodule. For example, if parts of the Random interface isn't properly implemented by CUDA.jl, you can look at the CURAND documentation and possibly call methods from the CURAND
submodule directly. These submodules are available after importing the CUDA package.
Construction and Initialization
The CuArray
type aims to implement the AbstractArray
interface, and provide implementations of methods that are commonly used when working with arrays. That means you can construct CuArray
s in the same way as regular Array
objects:
julia> CuArray{Int}(undef, 2)
2-element CuArray{Int64, 1}:
0
0
julia> CuArray{Int}(undef, (1,2))
1×2 CuArray{Int64, 2}:
0 0
julia> similar(ans)
1×2 CuArray{Int64, 2}:
0 0
Copying memory to or from the GPU can be expressed using constructors as well, or by calling copyto!
:
julia> a = CuArray([1,2])
2-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
2
julia> b = Array(a)
2-element Vector{Int64}:
1
2
julia> copyto!(b, a)
2-element Vector{Int64}:
1
2
Higher-order abstractions
The real power of programming GPUs with arrays comes from Julia's higher-order array abstractions: Operations that take user code as an argument, and specialize execution on it. With these functions, you can often avoid having to write custom kernels. For example, to perform simple element-wise operations you can use map
or broadcast
:
julia> a = CuArray{Float32}(undef, (1,2));
julia> a .= 5
1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
5.0 5.0
julia> map(sin, a)
1×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
-0.958924 -0.958924
To reduce the dimensionality of arrays, CUDA.jl implements the various flavours of (map)reduce(dim)
:
julia> a = CUDA.ones(2,3)
2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0
1.0 1.0 1.0
julia> reduce(+, a)
6.0f0
julia> mapreduce(sin, *, a; dims=2)
2×1 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.59582335
0.59582335
julia> b = CUDA.zeros(1)
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.0
julia> Base.mapreducedim!(identity, +, b, a)
1×1 CuArray{Float32, 2, CUDA.DeviceMemory}:
6.0
To retain intermediate values, you can use accumulate
:
julia> a = CUDA.ones(2,3)
2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 1.0 1.0
1.0 1.0 1.0
julia> accumulate(+, a; dims=2)
2×3 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.0 2.0 3.0
1.0 2.0 3.0
Be wary that the operator f
of accumulate
, accumulate!
, scan
and scan!
must be associative since the operation is performed in parallel. That is f(f(a,b)c)
must be equivalent to f(a,f(b,c))
. Accumulating with a non-associative operator on a CuArray
will not produce the same result as on an Array
.
Logical operations
CuArray
s can also be indexed with arrays of boolean values to select items:
julia> a = CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
2
3
julia> a[[false,true,false]]
1-element CuArray{Int64, 1, CUDA.DeviceMemory}:
2
Built on top of this, are several functions with higher-level semantics:
julia> a = CuArray([11,12,13])
3-element CuArray{Int64, 1, CUDA.DeviceMemory}:
11
12
13
julia> findall(isodd, a)
2-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
3
julia> findfirst(isodd, a)
1
julia> b = CuArray([11 12 13; 21 22 23])
2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:
11 12 13
21 22 23
julia> findmin(b)
(11, CartesianIndex(1, 1))
julia> findmax(b; dims=2)
([13; 23;;], CartesianIndex{2}[CartesianIndex(1, 3); CartesianIndex(2, 3);;])
Array wrappers
To some extent, CUDA.jl also supports well-known array wrappers from the standard library:
julia> a = CuArray(collect(1:10))
10-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
2
3
4
5
6
7
8
9
10
julia> a = CuArray(collect(1:6))
6-element CuArray{Int64, 1, CUDA.DeviceMemory}:
1
2
3
4
5
6
julia> b = reshape(a, (2,3))
2×3 CuArray{Int64, 2, CUDA.DeviceMemory}:
1 3 5
2 4 6
julia> c = view(a, 2:5)
4-element CuArray{Int64, 1, CUDA.DeviceMemory}:
2
3
4
5
The above contiguous view
and reshape
have been specialized to return new objects of type CuArray
. Other wrappers, such as non-contiguous views or the LinearAlgebra wrappers that will be discussed below, are implemented using their own type (e.g. SubArray
or Transpose
). This can cause problems, as calling methods with these wrapped objects will not dispatch to specialized CuArray
methods anymore. That may result in a call to fallback functionality that performs scalar iteration.
Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a CuArray
again.
Random numbers
Base's convenience functions for generating random numbers are available in the CUDA module as well:
julia> CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.74021935
0.9209938
julia> CUDA.randn(Float64, 2, 1)
2×1 CuArray{Float64, 2, CUDA.DeviceMemory}:
-0.3893830994647195
1.618410515635752
Behind the scenes, these random numbers come from two different generators: one backed by CURAND, another by kernels defined in CUDA.jl. Operations on these generators are implemented using methods from the Random standard library:
julia> using Random
julia> a = Random.rand(CURAND.default_rng(), Float32, 1)
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.74021935
julia> a = Random.rand!(CUDA.default_rng(), a)
1-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.46691537
CURAND also supports generating lognormal and Poisson-distributed numbers:
julia> CUDA.rand_logn(Float32, 1, 5; mean=2, stddev=20)
1×5 CuArray{Float32, 2, CUDA.DeviceMemory}:
2567.61 4.256f-6 54.5948 0.00283999 9.81175f22
julia> CUDA.rand_poisson(UInt32, 1, 10; lambda=100)
1×10 CuArray{UInt32, 2, CUDA.DeviceMemory}:
0x00000058 0x00000066 0x00000061 … 0x0000006b 0x0000005f 0x00000069
Note that these custom operations are only supported on a subset of types.
Linear algebra
CUDA's linear algebra functionality from the CUBLAS library is exposed by implementing methods in the LinearAlgebra standard library:
julia> # enable logging to demonstrate a CUBLAS kernel is used
CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)
julia> CUDA.rand(2,2) * CUDA.rand(2,2)
I! cuBLAS (v10.2) function cublasStatus_t cublasSgemm_v2(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) called
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.295727 0.479395
0.624576 0.557361
Certain operations, like the above matrix-matrix multiplication, also have a native fallback written in Julia for the purpose of working with types that are not supported by CUBLAS:
julia> # enable logging to demonstrate no CUBLAS kernel is used
CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)
julia> CUDA.rand(Int128, 2, 2) * CUDA.rand(Int128, 2, 2)
2×2 CuArray{Int128, 2, CUDA.DeviceMemory}:
-147256259324085278916026657445395486093 -62954140705285875940311066889684981211
-154405209690443624360811355271386638733 -77891631198498491666867579047988353207
Operations that exist in CUBLAS, but are not (yet) covered by high-level constructs in the LinearAlgebra standard library, can be accessed directly from the CUBLAS submodule. Note that you do not need to call the C wrappers directly (e.g. cublasDdot
), as many operations have more high-level wrappers available as well (e.g. dot
):
julia> x = CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.74021935
0.9209938
julia> y = CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.DeviceMemory}:
0.03902049
0.9689629
julia> CUBLAS.dot(2, x, y)
0.92129254f0
julia> using LinearAlgebra
julia> dot(Array(x), Array(y))
0.92129254f0
Solver
LAPACK-like functionality as found in the CUSOLVER library can be accessed through methods in the LinearAlgebra standard library too:
julia> using LinearAlgebra
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.740219 0.0390205
0.920994 0.968963
julia> a = a * a'
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.549447 0.719547
0.719547 1.78712
julia> cholesky(a)
Cholesky{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}
U factor:
2×2 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.DeviceMemory}}:
0.741247 0.970725
⋅ 0.919137
Other operations are bound to the left-division operator:
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.740219 0.0390205
0.920994 0.968963
julia> b = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.925141 0.667319
0.44635 0.109931
julia> a \ b
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
1.29018 0.942773
-0.765663 -0.782648
julia> Array(a) \ Array(b)
2×2 Matrix{Float32}:
1.29018 0.942773
-0.765663 -0.782648
Sparse arrays
Sparse array functionality from the CUSPARSE library is mainly available through functionality from the SparseArrays package applied to CuSparseArray
objects:
julia> using SparseArrays
julia> x = sprand(10,0.2)
10-element SparseVector{Float64, Int64} with 5 stored entries:
[2 ] = 0.538639
[4 ] = 0.89699
[6 ] = 0.258478
[7 ] = 0.338949
[10] = 0.424742
julia> using CUDA.CUSPARSE
julia> d_x = CuSparseVector(x)
10-element CuSparseVector{Float64, Int32} with 5 stored entries:
[2 ] = 0.538639
[4 ] = 0.89699
[6 ] = 0.258478
[7 ] = 0.338949
[10] = 0.424742
julia> nonzeros(d_x)
5-element CuArray{Float64, 1, CUDA.DeviceMemory}:
0.538639413965653
0.8969897902567084
0.25847781536337067
0.3389490517221738
0.4247416640213063
julia> nnz(d_x)
5
For 2-D arrays the CuSparseMatrixCSC
and CuSparseMatrixCSR
can be used.
Non-integrated functionality can be access directly in the CUSPARSE submodule again.
FFTs
Functionality from CUFFT is integrated with the interfaces from the AbstractFFTs.jl package:
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.DeviceMemory}:
0.740219 0.0390205
0.920994 0.968963
julia> using CUDA.CUFFT
julia> fft(a)
2×2 CuArray{ComplexF32, 2, CUDA.DeviceMemory}:
2.6692+0.0im 0.65323+0.0im
-1.11072+0.0im 0.749168+0.0im