GPUnet allows GPU programs to communicate directly from a GPU, cutting out the CPU code development from the loop. This is the key to programming simplicity.
GPUrdma allows GPU programs to issue RDMA requests directly from GPU without traversing the CPU, and cut the roundtrip latency down to 5us.
Centaur is a GPU-centric architecture for building a low-latency multi-GPU network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. Our experiments systems show that our server achieves near-perfect scaling for k-NN service on 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture.