# Array programming

The easiest way to use the GPU's massive parallelism, is by expressing operations in terms of arrays: CUDA.jl provides an array type, `CuArray`

, and many specialized array operations that execute efficiently on the GPU hardware. In this section, we will briefly demonstrate use of the `CuArray`

type. Since we expose CUDA's functionality by implementing existing Julia interfaces on the `CuArray`

type, you should refer to the upstream Julia documentation for more information on these operations.

If you encounter missing functionality, or are running into operations that trigger so-called "scalar iteration", have a look at the issue tracker and file a new issue if there's none. Do note that you can always access the underlying CUDA APIs by calling into the relevant submodule. For example, if parts of the Random interface isn't properly implemented by CUDA.jl, you can look at the CURAND documentation and possibly call methods from the `CURAND`

submodule directly. These submodules are available after importing the CUDA package.

## Construction and Initialization

The `CuArray`

type aims to implement the `AbstractArray`

interface, and provide implementations of methods that are commonly used when working with arrays. That means you can construct `CuArray`

s in the same way as regular `Array`

objects:

```
julia> CuArray{Int}(undef, 2)
2-element CuArray{Int64, 1}:
0
0
julia> CuArray{Int}(undef, (1,2))
1×2 CuArray{Int64, 2}:
0 0
julia> similar(ans)
1×2 CuArray{Int64, 2}:
0 0
```

Copying memory to or from the GPU can be expressed using constructors as well, or by calling `copyto!`

:

```
julia> a = CuArray([1,2])
2-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
2
julia> b = Array(a)
2-element Vector{Int64}:
1
2
julia> copyto!(b, a)
2-element Vector{Int64}:
1
2
```

## Higher-order abstractions

The real power of programming GPUs with arrays comes from Julia's higher-order array abstractions: Operations that take user code as an argument, and specialize execution on it. With these functions, you can often avoid having to write custom kernels. For example, to perform simple element-wise operations you can use `map`

or `broadcast`

:

```
julia> a = CuArray{Float32}(undef, (1,2));
julia> a .= 5
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
5.0 5.0
julia> map(sin, a)
1×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
-0.958924 -0.958924
```

To reduce the dimensionality of arrays, CUDA.jl implements the various flavours of `(map)reduce(dim)`

:

```
julia> a = CUDA.ones(2,3)
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0 1.0
1.0 1.0 1.0
julia> reduce(+, a)
6.0f0
julia> mapreduce(sin, *, a; dims=2)
2×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.59582335
0.59582335
julia> b = CUDA.zeros(1)
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.0
julia> Base.mapreducedim!(identity, +, b, a)
1×1 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
6.0
```

To retain intermediate values, you can use `accumulate`

:

```
julia> a = CUDA.ones(2,3)
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0 1.0
1.0 1.0 1.0
julia> accumulate(+, a; dims=2)
2×3 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.0 2.0 3.0
1.0 2.0 3.0
```

Be wary that the operator `f`

of `accumulate`

, `accumulate!`

, `scan`

and `scan!`

must be associative since the operation is performed in parallel. That is `f(f(a,b)c)`

must be equivalent to `f(a,f(b,c))`

. Accumulating with a non-associative operator on a `CuArray`

will not produce the same result as on an `Array`

.

## Logical operations

`CuArray`

s can also be indexed with arrays of boolean values to select items:

```
julia> a = CuArray([1,2,3])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
2
3
julia> a[[false,true,false]]
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
2
```

Built on top of this, are several functions with higher-level semantics:

```
julia> a = CuArray([11,12,13])
3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
11
12
13
julia> findall(isodd, a)
2-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
3
julia> findfirst(isodd, a)
1
julia> b = CuArray([11 12 13; 21 22 23])
2×3 CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}:
11 12 13
21 22 23
julia> findmin(b)
(11, CartesianIndex(1, 1))
julia> findmax(b; dims=2)
([13; 23], CartesianIndex{2}[CartesianIndex(1, 3); CartesianIndex(2, 3)])
```

## Array wrappers

To some extent, CUDA.jl also supports well-known array wrappers from the standard library:

```
julia> a = CuArray(collect(1:10))
10-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
2
3
4
5
6
7
8
9
10
julia> a = CuArray(collect(1:6))
6-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
1
2
3
4
5
6
julia> b = reshape(a, (2,3))
2×3 CuArray{Int64, 2, CUDA.Mem.DeviceBuffer}:
1 3 5
2 4 6
julia> c = view(a, 2:5)
4-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
2
3
4
5
```

The above contiguous `view`

and `reshape`

have been specialized to return new objects of type `CuArray`

. Other wrappers, such as non-contiguous views or the LinearAlgebra wrappers that will be discussed below, are implemented using their own type (e.g. `SubArray`

or `Transpose`

). This can cause problems, as calling methods with these wrapped objects will not dispatch to specialized `CuArray`

methods anymore. That may result in a call to fallback functionality that performs scalar iteration.

Certain common operations, like broadcast or matrix multiplication, do know how to deal with array wrappers by using the Adapt.jl package. This is still not a complete solution though, e.g. new array wrappers are not covered, and only one level of wrapping is supported. Sometimes the only solution is to materialize the wrapper to a `CuArray`

again.

## Random numbers

Base's convenience functions for generating random numbers are available in the CUDA module as well:

```
julia> CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.74021935
0.9209938
julia> CUDA.randn(Float64, 2, 1)
2×1 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
-0.3893830994647195
1.618410515635752
```

Behind the scenes, these random numbers come from two different generators: one backed by CURAND, another by kernels defined in CUDA.jl. Operations on these generators are implemented using methods from the Random standard library:

```
julia> using Random
julia> a = Random.rand(CURAND.default_rng(), Float32, 1)
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.74021935
julia> a = Random.rand!(CUDA.default_rng(), a)
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.46691537
```

CURAND also supports generating lognormal and Poisson-distributed numbers:

```
julia> CUDA.rand_logn(Float32, 1, 5; mean=2, stddev=20)
1×5 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
2567.61 4.256f-6 54.5948 0.00283999 9.81175f22
julia> CUDA.rand_poisson(UInt32, 1, 10; lambda=100)
1×10 CuArray{UInt32, 2, CUDA.Mem.DeviceBuffer}:
0x00000058 0x00000066 0x00000061 … 0x0000006b 0x0000005f 0x00000069
```

Note that these custom operations are only supported on a subset of types.

## Linear algebra

CUDA's linear algebra functionality from the CUBLAS library is exposed by implementing methods in the LinearAlgebra standard library:

```
julia> # enable logging to demonstrate a CUBLAS kernel is used
CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)
julia> CUDA.rand(2,2) * CUDA.rand(2,2)
I! cuBLAS (v10.2) function cublasStatus_t cublasSgemm_v2(cublasContext*, cublasOperation_t, cublasOperation_t, int, int, int, const float*, const float*, int, const float*, int, const float*, float*, int) called
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.295727 0.479395
0.624576 0.557361
```

Certain operations, like the above matrix-matrix multiplication, also have a native fallback written in Julia for the purpose of working with types that are not supported by CUBLAS:

```
julia> # enable logging to demonstrate no CUBLAS kernel is used
CUBLAS.cublasLoggerConfigure(1, 0, 1, C_NULL)
julia> CUDA.rand(Int128, 2, 2) * CUDA.rand(Int128, 2, 2)
2×2 CuArray{Int128, 2, CUDA.Mem.DeviceBuffer}:
-147256259324085278916026657445395486093 -62954140705285875940311066889684981211
-154405209690443624360811355271386638733 -77891631198498491666867579047988353207
```

Operations that exist in CUBLAS, but are not (yet) covered by high-level constructs in the LinearAlgebra standard library, can be accessed directly from the CUBLAS submodule. Note that you do not need to call the C wrappers directly (e.g. `cublasDdot`

), as many operations have more high-level wrappers available as well (e.g. `dot`

):

```
julia> x = CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.74021935
0.9209938
julia> y = CUDA.rand(2)
2-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
0.03902049
0.9689629
julia> CUBLAS.dot(2, x, y)
0.92129254f0
julia> using LinearAlgebra
julia> dot(Array(x), Array(y))
0.92129254f0
```

## Solver

LAPACK-like functionality as found in the CUSOLVER library can be accessed through methods in the LinearAlgebra standard library too:

```
julia> using LinearAlgebra
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.740219 0.0390205
0.920994 0.968963
julia> a = a * a'
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.549447 0.719547
0.719547 1.78712
julia> cholesky(a)
Cholesky{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}
U factor:
2×2 UpperTriangular{Float32, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}}:
0.741247 0.970725
⋅ 0.919137
```

Other operations are bound to the left-division operator:

```
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.740219 0.0390205
0.920994 0.968963
julia> b = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.925141 0.667319
0.44635 0.109931
julia> a \ b
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
1.29018 0.942772
-0.765663 -0.782648
julia> Array(a) \ Array(b)
2×2 Matrix{Float32}:
1.29018 0.942773
-0.765663 -0.782648
```

## Sparse arrays

Sparse array functionality from the CUSPARSE library is mainly available through functionality from the SparseArrays package applied to `CuSparseArray`

objects:

```
julia> using SparseArrays
julia> x = sprand(10,0.2)
10-element SparseVector{Float64, Int64} with 4 stored entries:
[3 ] = 0.585812
[4 ] = 0.539289
[7 ] = 0.260036
[8 ] = 0.910047
julia> using CUDA.CUSPARSE
julia> d_x = CuSparseVector(x)
10-element CuSparseVector{Float64, Int32} with 4 stored entries:
[3 ] = 0.585812
[4 ] = 0.539289
[7 ] = 0.260036
[8 ] = 0.910047
julia> nonzeros(d_x)
4-element CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}:
0.5858115517433242
0.5392892841426182
0.26003585026904785
0.910046541351011
julia> nnz(d_x)
4
```

For 2-D arrays the `CuSparseMatrixCSC`

and `CuSparseMatrixCSR`

can be used.

Non-integrated functionality can be access directly in the CUSPARSE submodule again.

## FFTs

Functionality from CUFFT is integrated with the interfaces from the AbstractFFTs.jl package:

```
julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
0.740219 0.0390205
0.920994 0.968963
julia> using CUDA.CUFFT
julia> fft(a)
2×2 CuArray{ComplexF32, 2, CUDA.Mem.DeviceBuffer}:
2.6692+0.0im 0.65323+0.0im
-1.11072+0.0im 0.749168+0.0im
```