Accelerator-Centric Operating System

Specialized programmable hardware accelerators, i.e., GPUs, Smart Network Adapters, Smart Storage Drives and FPGAs, have become essential components of modern computing systems. However, such heterogeneity of processing elements challenges the software architecture dominated by traditional CPU-centric programming models and abstractions.

We are working on bridging the gap between modern hardware and software by introducing a new accelerator-centric operating system architecture. The goal is to turn the accelerators into first-class programmable devices on a par with the CPU, thereby hiding the hardware heterogeneity behind convenient OS abstractions. Many projects in the lab materialize this vision.

See the video with the overview here, and a more recent talk here.

Image for Accelerator-Centric Operating System
Accelerator-centric Operating System Architecture, OmniX, enables direct interaction between accelerators and I/O devices, for example, files and network sockets for GPU kernels
Picture for Inline processing on FPGA SmartNICs Topics We develop novel OS abstractions and a framework for accelerating server applications on FPGA SmartNICs, as well as a novel FPGA-based driver for direct control of an ASIC NIC to make the NIC hardware offloads accessible to the FPGA logic
Picture for SmartNIC-driven accelerator-centric architecture Topics We show that SmartNICs today provide enough horse-power to drive up to hundred of low-latency ML accelerators, completely offloading network stack processing from the host CPU. When running an inference server with 12 GPUs at 300usec per inference request, the CPU utilization is 0.
Picture for GPU  Storage and File System Access Topics This project shows how to enable native file access from the GPU kernel.  We build a GPU file system layer (GPUfs), with GPU-centric memory-mapped file management  (ActivePointers), GPU-page fault based memory mapped files  integrated with the Linux page cache (GAIA) and full integration of peer-to-peer GPU-to-SSD transfers with the Linux page cache and file I/O API (SPIN).
Picture for GPUNICs Topics We develop convenient programming abstractions for GPUs to perform highly efficient low-latency network I/O without the use of CPUs. The GPU networking layer is used to design low-latency multi-GPU network servers that do not use the host CPU.

Confidential Computing

SpecFuzz | SpecFuzz instrumentation compiler and fuzzer allows revealing code fragments that are likely to be exploitable in speculative execution attacks (Spectre V1).
Varys | Varys protects SGX against practical side-channel attacks by identifying SGX exits and proactively cleaning the cache and CPU state, while guaranteeing co-scheduled execution of trusted SGX threads on a CPU without the need to disable hyperthreading.
Power to Peep-All | Inference attacks by malicious batteries on mobile devices
Secure Virtual Memory |
Overriding memory operations in the enclave allows building another level of custom secure application managed virtual memory. As a result, any in-enclave application can be easily made more secure or achieve higher performance by simply recompiling with the correct memory backing store.
Foreshadow | Speculative execution attack on SGX enclaves
Exitless system calls for SGX enclaves | We analyze the overheads of SGX enclaves transitions between trusted and untrusted mode of execution. We propose a fast remote procedure call (RPC) mechanism to eliminate these overheads completely.
Autarky | Autarky protects SGX against the page fault side-channel attack with practical architecture modifications and a self-paging mechanism that collaborates with the OS to support secure demand-paging.

Programmable Networks

Using Neural Nets for Scaling Multidimensional Range Matching | Our method offers a new point in the design space of multidimensional range matching algorithms, replacing pointer chasing (the primary method used today) with neural network inference. This trading of memory accesses for computations is very promising, given the poor scalability of on-chip memories and the development of computational engines in modern and emerging CPUs. See the talk on this topic.
Distributed Applications in Programmable Switches | Network switches can run pretty sophisticated data-plane programs, but writing them is extremely hard. This project strives to simplify building distributed algorithms in switches

GPU computing, Networking, Machine Learning, Distributed Systems

Accelerating distributed DNN training | This project looks into ways to reduce the communication and computation bottleneck in DNN training by using smart sampling.
Software-managed caching in GPUs | GPUs have small caches, so using some smart techniques to leverage scratchpad and registers can be highly beneficial.