Accelerator-Centric Operating System

Specialized programmable hardware accelerators, i.e., GPUs, Smart Network Adapters, Smart Storage Drives and FPGAs, have become essential components of modern computing systems. However, such heterogeneity of processing elements challenges the software architecture dominated by traditional CPU-centric programming models and abstractions.

We are working on bridging the gap between modern hardware and software by introducing a new accelerator-centric operating system architecture. The goal is to turn the accelerators into first-class programmable devices on a par with the CPU, thereby hiding the hardware heterogeneity behind convenient OS abstractions. Many projects in the lab materialize this vision.

See the video with the overview here, and a more recent talk here.

Image for Accelerator-Centric Operating System
Accelerator-centric Operating System Architecture, OmniX, enables direct interaction between accelerators and I/O devices, for example, files and network sockets for GPU kernels
Picture for SmartNIC Accelerators Topics We show that SmartNICs today provide enough horse-power to drive up to hundred of low-latency ML accelerators, completely offloading network stack processing from the host CPU. When running an inference server with 12 GPUs at 300usec per inference request, the CPU utilization is 0.
Picture for NICA: Inline Near-Data Processing on SmartNICs Topics NICA is a hardware and software framework for accelerating server applications via inline data processing. It offers new OS abstractions for inline processing and runs on FPGA-based SmartNICs. It integrates inline processing with standard network stacks and enables SmartNIC virtualization.  
Picture for GPU  Storage and File System Access Topics This project shows how to enable native file access from the GPU kernel.  We build a GPU file system layer (GPUfs), with GPU-centric memory-mapped file management  (ActivePointers), GPU-page fault based memory mapped files  integrated with the Linux page cache (GAIA) and full integration of peer-to-peer GPU-to-SSD transfers with the Linux page cache and file I/O API (SPIN).
Picture for GPUNICs Topics We develop convenient programming abstractions for GPUs to perform highly efficient low-latency network I/O without the use of CPUs. The GPU networking layer is used to design low-latency multi-GPU network servers that do not use the host CPU.

OS Services for Trusted Execution Environments

Our projects enable overriding memory operations in the enclave, allowing to build another level of custom secure application managed virtual memory. As a result, any in-enclave application can be easily made more secure or achieve higher performance by simply recompiling with the correct memory backing store.

Secure Virtual Memory |
Overriding memory operations in the enclave allows building another level of custom secure application managed virtual memory. As a result, any in-enclave application can be easily made more secure or achieve higher performance by simply recompiling with the correct memory backing store.
Exitless system calls for SGX enclaves | We analyze the overheads of SGX enclaves transitions between trusted and untrusted mode of execution. We propose a fast remote procedure call (RPC) mechanism to eliminate these overheads completely.

Harware Side Channels

SpecFuzz | SpecFuzz instrumentation compiler and fuzzer allows revealing code fragments that are likely to be exploitable in speculative execution attacks (Spectre V1).
Varys | Varys protects SGX against practical side-channel attacks by identifying SGX exits and proactively cleaning the cache and CPU state, while guaranteeing co-scheduled execution of trusted SGX threads on a CPU without the need to disable hyperthreading.
Power to Peep-All | Inference attacks by malicious batteries on mobile devices
Foreshadow | Speculative execution attack on SGX enclaves
Autarky | Autarky protects SGX against the page fault side-channel attack with practical architecture modifications and a self-paging mechanism that collaborates with the OS to support secure demand-paging.

GPU computing, Networking, Machine Learning, Distributed Systems

A Computational Approach to Packet Classification | Can we use machine-learning techniques to convert the packet-classification task from memory- to compute-bound?
Using DNNs in computer systems | What if we could perform DNN inference in a few CPU cycles? Can DNNs be used to efficiently store large data sets and transform memory-bound applications into compute-bound?
Accelerating distributed DNN training | This project looks into ways to reduce the communication and computation bottleneck in DNN training by using smart sampling.
Distributed Applications in Programmable Switches | Network switches can run pretty sophisticated data-plane programs, but writing them is extremely hard. This project strives to simplify building distributed algorithms in switches
Software-managed caching in GPUs | GPUs have small caches, so using some smart techniques to leverage scratchpad and registers can be highly beneficial.