GPUnet: networking abstractions for GPU programs

Code example

This is a code example of a simple (working) GPU echo client.

Note that the GPU socket API is threadblock-cooperative, meaning that all the threads in the threadblock are required to call the same function with the same parameters at the same point in a program.

__global__ void gpuclient(struct sockaddr_in *addr, int* tb_alloc_tbl, int nr_tb) {
        __shared__ int sock;
        __shared__ uchar buf[BUF_SIZE];
        int ret, i;

while ((sock = gconnect_in(addr)) < 0) {};
assert(sock >= 0);

for (i = 0; i < NR_MSG; i++) {
int recved = 0, sent = 0;

                do {
                        ret = gsend(sock, buf + sent, BUF_SIZE - sent);
                        if (ret < 0) {
                                goto out;
                        } else {
                                sent += ret;
                        }
                } while (sent < BUF_SIZE);

__syncthreads();

                do {
                        ret = grecv(sock, buf + recved, BUF_SIZE - recved);
                        if (ret < 0) {
                                goto out;
                        } else {
                                recved += ret;
                        }
                } while (recved < BUF_SIZE);

__syncthreads();

}

out:
        BEGIN_SINGLE_THREAD_PART {
                single_thread_gclose(sock);
        } END_SINGLE_THREAD_PART;
}

GPUnet layers

GPUnet allows GPU programs to communicate directly from a GPU, cutting out the CPU code development from the loop. This is the key to the programming simplicity. The top layer of the GPUnet infrastructure provides a unified networking API with reliable streaming abstraction. GPUnet implements blocking and non-block versions of commonly used API functions, like send and recv.

The bottom layer exposes a reliable channel abstraction to the upper layers. A channel enables sending fixed-size messages reliably between two endpoints. To achieve high performance, GPUnet utilizes the advanced capabilities of Infiniband HCAs, which implement the RDMA functionality in hardware. Moreover, if the chipset and the GPU support peer-to-peer DMA over PCIe, GPUnet uses it to store network buffers directly in GPU memory, bypassing the CPU. As the figure below shows, CPU programs or other GPUs can use the same HCA at the same time.

However, if no RDMA is available, as is the case in UNIX domain sockets or TCP, GPUnet seeks CPU assistance for transport-layer processing and moving data in and out GPU memory. The top layer interface, however, does not change. Note that this layer does not guarantee high performance, since it relies on the specific hardware capabilities of the underlying hardware.

The intermediate layer of GPUnet implements a reliable stream abstraction on top of reliable channels, which requires implementing the stream flow control in GPUnet.

Under the hood

GPUnet uses a CPU helper for HCA management and connection establishment. This allows GPUnet to share the network namespace with CPU programs on the same machine, e.g., the ports used by CPU programs cannot be used by GPUnet and vice versa.

However once the connection is established, no CPU involvement is supposed to be necessary. Unfortunately, in the current GPUnet implementation no CPU involvement is necessary for data transfers, however the control is still performed by a CPU helper. That is because in the current hardware a GPU is still unable to fully control a NIC. Specifically, NVIDIA GPUs and NVIDIA CUDA that we use to implement GPUnet do not allow mapping hardware MMIO regions into GPU memory space, which is absolutely necessary to access NIC door-bell registers, and trigger the RDMA transfer.

So we employ a ring buffer for NIC-to-GPU and GPU-to-NIC communication as depicted above. The ring buffer implements a producer-consumer over PCIe, allowing a GPU to post send requests, and let CPU handle them asynchronously.

System components

The GPU application is linked with the GPUnet library which implements flow control for network buffers in GPU memory. On the CPU side GPUnet proxy is executed in a separate application thread of the application that invokes and manages the GPU program, and manages the NIC on behalf of the GPU. For hardware without GPUDirectRDMA support GPUnet implements bounce buffers in CPU memory.

Applications

We developed a fully functional MapReduce system written entirely in GPU code – GimMR. It uses GPUfs to read inputs and write outputs to the disk. The key point of our implementation is that during the shuffle stage the data is sent to a peer GPU in the system, making it an “In-GPU-Memory” distributed application. GPUnet enables us to scale out the single node GPU application to several GPUs located in different machines.

We also built a GPU-native face verification server which can be used as a part of a security system for access authorization, like border control.

The input is a photo of a person and her name. The system fetches the photos of that person stored in the database and compares them with the input. We use Local Binary Pattern algorithm (LBP) for image comparison.

In the GPUnet-based server the GPU maintains persistent connections with an unmodified memcached server running on another machine, as well as with two other machines running multiple (unmodified) CPU clients. On the CPU side we are using rsocket — a socket drop-in user-level replacement library for streaming over Infiniband RDMA.

GPU-native server design options

We considered several GPU-native design options. In this figure we present three server designs: the left most – the standard GPU-accelerated design, middle –

Performance

Our microbenchmarks indicate that the raw bandwidth achievable between two GPUs or GPUs and CPUs are within few per cents from the CPU-CPU communication. The communication latency, however, is quite high. The main slowdown factor is CPU-GPU coordination which is going to be eliminated once a GPU gains access to the NIC’s doorbell registers.

GPU-native face verification performance

back