Multiple GPUs

There are different ways of working with multiple GPUs: using one or more tasks, processes, or systems. Although all of these are compatible with the Julia CUDA toolchain, the support is a work in progress and the usability of some combinations can be significantly improved.

Scenario 1: One GPU per process

The easiest solution that maps well onto Julia's existing facilities for distributed programming, is to use one GPU per process

# spawn one worker per device
using Distributed, CUDA
@everywhere using CUDA

# assign devices
asyncmap((zip(workers(), devices()))) do (p, d)
    remotecall_wait(p) do
        @info "Worker $p uses $d"

Communication between nodes should happen via the CPU (the CUDA IPC APIs are available as CUDA.cuIpcOpenMemHandle and friends, but not available through high-level wrappers).

Alternatively, one can use MPI.jl together with an CUDA-aware MPI implementation. In that case, CuArray objects can be passed as send and receive buffers to point-to-point and collective operations to avoid going through the CPU.

Scenario 2: Multiple GPUs per process

In a similar vein to the multi-process solution, one can work with multiple devices from within a single process by calling CUDA.device! to switch to a specific device. Furthermore, as the active device is a task-local property you can easily work with multiple devices using one task per device. For more details, refer to the section on Tasks and threads.


You currently need to re-set the device at the start of every task, i.e., call device! as one of the first statement in your @async or @spawn block:

@sync begin
    @async begin
        # do work on GPU 0 here
    @async begin
        # do work on GPU 1 here

Without this, the newly-created task would use the same device as the previously-executing task, and not the parent task as could be expected. This is expected to be improved in the future using context variables.

Memory management

When working with multiple devices, you need to be careful with allocated memory: Allocations are tied to the device that was active when requesting the memory, and cannot be used with another device. That means you cannot allocate a CuArray, switch devices, and use that object. Similar restrictions apply to library objects, like CUFFT plans.

To avoid this difficulty, you can use unified memory that is accessible from all devices:

using CUDA

gpus = Int(length(devices()))

# generate CPU data
dims = (3,4,gpus)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)

# allocate and initialize GPU data
d_a = cu(a; unified=true)
d_b = cu(b; unified=true)
d_c = similar(d_a)

The data allocated here uses the GPU id as a the outermost dimension, which can be used to extract views of contiguous memory that represent the slice to be processed by a single GPU:

for (gpu, dev) in enumerate(devices())
    @views d_c[:, :, gpu] .= d_a[:, :, gpu] .+ d_b[:, :, gpu]

Before downloading the data, make sure to synchronize the devices:

for dev in devices()
    # NOTE: normally you'd use events and wait for them

using Test
c = Array(d_c)
@test a+b ≈ c