GPUnet is a native GPU networking layer that provides a reliable stream abstraction over Infiniband and high-level socket APIs to GPU programs for NVIDIA GPUs.
GPUnet enables threads or threadblocks in one GPU to communicate with threads in other GPUs or CPUs via standard and familiar socket interfaces, regardless of whether they are in the same or different machines.
GPUnet uses Peer-to-Peer DMA (via GPUDirectRDMA) to place and manage network buffers of a GPU application directly in GPU memory.
__global__ void gpuclient(struct sockaddr_in *addr, int* tb_alloc_tbl, int nr_tb) {
__shared__ int sock;
__shared__ uchar buf[BUF_SIZE];
int ret, i;
while ((sock = gconnect_in(addr)) < 0) {};
assert(sock >= 0);
for (i = 0; i < NR_MSG; i++) {
int recved = 0, sent = 0;
do {
ret = gsend(sock, buf + sent, BUF_SIZE - sent);
if (ret < 0) {
goto out;
} else {
sent += ret;
}
} while (sent < BUF_SIZE);
__syncthreads();
do {
ret = grecv(sock, buf + recved, BUF_SIZE - recved);
if (ret < 0) {
goto out;
} else {
recved += ret;
}
} while (recved < BUF_SIZE);
__syncthreads();
}
out:
BEGIN_SINGLE_THREAD_PART {
single_thread_gclose(sock);
} END_SINGLE_THREAD_PART;
}
GPUnet allows GPU programs to communicate directly from a GPU, cutting out the CPU code development from the loop. This is the key to the programming simplicity. The top layer of the GPUnet infrastructure provides a unified networking API with reliable streaming abstraction. GPUnet implements blocking and non-block versions of commonly used API functions, like send and recv.
The bottom layer exposes a reliable channel abstraction to the upper layers. A channel enables sending fixed-size messages reliably between two endpoints. To achieve high performance, GPUnet utilizes the advanced capabilities of Infiniband HCAs, which implement the RDMA functionality in hardware. Moreover, if the chipset and the GPU support peer-to-peer DMA over PCIe, GPUnet uses it to store network buffers directly in GPU memory, bypassing the CPU. As the figure below shows, CPU programs or other GPUs can use the same HCA at the same time.
However, if no RDMA is available, as is the case in UNIX domain sockets or TCP, GPUnet seeks CPU assistance for transport-layer processing and moving data in and out GPU memory. The top layer interface, however, does not change. Note that this layer does not guarantee high performance, since it relies on the specific hardware capabilities of the underlying hardware.
The intermediate layer of GPUnet implements a reliable stream abstraction on top of reliable channels, which requires implementing the stream flow control in GPUnet.
GPUnet uses a CPU helper for HCA management and connection establishment. This allows GPUnet to share the network namespace with CPU programs on the same machine, e.g., the ports used by CPU programs cannot be used by GPUnet and vice versa.
However once the connection is established, no CPU involvement is supposed to be necessary. Unfortunately, in the current GPUnet implementation no CPU involvement is necessary for data transfers, however the control is still performed by a CPU helper. That is because in the current hardware a GPU is still unable to fully control a NIC. Specifically, NVIDIA GPUs and NVIDIA CUDA that we use to implement GPUnet do not allow mapping hardware MMIO regions into GPU memory space, which is absolutely necessary to access NIC door-bell registers, and trigger the RDMA transfer.
So we employ a ring buffer for NIC-to-GPU and GPU-to-NIC communication as depicted above. The ring buffer implements a producer-consumer over PCIe, allowing a GPU to post send requests, and let CPU handle them asynchronously.