Publications

Accelerator-Centric Operating System

[SIGCOMM]   A computational approach to packet classification
BibTeX for A computational approach to packet classification

A computational approach to packet classification

BibTeX
@misc{rashelbach2020computational,
title={A Computational Approach to Packet Classification},
author={Alon Rashelbach and Ori Rottenstreich and Mark Silberstein},
year={2020},
eprint={2002.07584},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
abstract for A computational approach to packet classification

A computational approach to packet classification

Abstract

Multi-field packet classification is a crucial component in modern software-defined data center networks. To achieve high throughput and low latency, state-of-the-art algorithms strive to fit the rule lookup data structures into on-die caches; however, they do not scale well with the number of rules. We present a novel approach, NuevoMatch, which improves the memory scaling of existing methods. A new data structure, Range Query Recursive Model Index (RQ-RMI), is the key component that enables NuevoMatch to replace most of the accesses to main memory with model inference computations. We describe an efficient training algorithm which guarantees the correctness of the RQ-RMI-based classification. The use of RQ-RMI allows the packet rules to be compressed into model weights that fit into the hardware cache and takes advantage of the growing support for fast neural network processing in modern CPUs, such as wide vector processing engines, achieving a rate of tens of nanoseconds per lookup. Our evaluation using 500K multi-field rules from the standard ClassBench benchmark shows a geomean compression factor of 4.9X, 8X, and 82X, and average performance improvement of 2.7X, 4.4X and 2.6X in latency and 1.3X, 2.2X, and 1.2X in throughput compared to CutSplit, NeuroCuts, and TupleMerge, all state-of-the-art algorithms.

BibTeX
@misc{rashelbach2020computational,
title={A Computational Approach to Packet Classification},
author={Alon Rashelbach and Ori Rottenstreich and Mark Silberstein},
year={2020},
eprint={2002.07584},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
Alon Rashelbach, Ori Rottenstreich, Mark Silberstein to appear
[ASPLOS]   Lynx: a SmartNIC-driven accelerator-centric architecture for network servers
BibTeX for Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

BibTeX
@inproceedings{lynx20Tork,
author = {Tork, Maroun and Maudlej, Lina and Silberstein, Mark},
title = {Lynx: A SmartNIC-Driven Accelerator-Centric Architecture for Network Servers},
year = {2020},
isbn = {9781450371025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3373376.3378528},
doi = {10.1145/3373376.3378528},
booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {117–131},
numpages = {15},
keywords = {server architecture, hardware accelerators, smartnics, i/o services for accelerators, operating systems},
location = {Lausanne, Switzerland},
series = {ASPLOS ’20}
}



abstract for Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

Lynx: a SmartNIC-driven accelerator-centric architecture for network servers

Abstract

This paper explores new opportunities afforded by the growing deployment of compute and I/O accelerators to improve the performance and efficiency of hardware-accelerated com-
puting services in data centers.

We propose Lynx, an accelerator-centric network server architecture that offloads the server data and control planes to the SmartNIC, and enables direct networking from accelerators via a lightweight hardware-friendly I/O mechanism. Lynx enables the design of hardware-accelerated network servers that run without CPU involvement, freeing CPU cores and improving performance isolation for accelerated services. It is portable across accelerator architectures and allows the management of both local and remote accelerators, seamlessly scaling beyond a single physical machine.

We implement and evaluate Lynx on GPUs and the Intel Visual Compute Accelerator, as well as two SmartNIC architectures – one with an FPGA, and another with an 8-core ARM processor. Compared to a traditional host-centric approach, Lynx achieves over 4× higher throughput for a GPU-centric face verification server, where it is used for GPU communications with an external database, and 25% higher throughput for a GPU-accelerated neural network inference service. For this workload, we show that a single SmartNIC may drive 4 local and 8 remote GPUs while achieving linear performance scaling without using the host CPU.

BibTeX
@inproceedings{lynx20Tork,
author = {Tork, Maroun and Maudlej, Lina and Silberstein, Mark},
title = {Lynx: A SmartNIC-Driven Accelerator-Centric Architecture for Network Servers},
year = {2020},
isbn = {9781450371025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3373376.3378528},
doi = {10.1145/3373376.3378528},
booktitle = {Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {117–131},
numpages = {15},
keywords = {server architecture, hardware accelerators, smartnics, i/o services for accelerators, operating systems},
location = {Lausanne, Switzerland},
series = {ASPLOS ’20}
}



slides for Lynx: a SmartNIC-driven accelerator-centric architecture for network servers video for Lynx: a SmartNIC-driven accelerator-centric architecture for network servers
[ACM TOCS]   SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
BibTeX for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

BibTeX
@article{spin19TOCS,
author = {Bergman, Shai and Brokhman, Tanya and Cohen, Tzachi and Silberstein, Mark},
title = {SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs},
year = {2019},
issue_date = {April 2019},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {36},
number = {2},
issn = {0734-2071},
url = {https://doi.org/10.1145/3309987},
doi = {10.1145/3309987},
journal = {ACM Trans. Comput. Syst.},
month = apr,
articleno = {Article 5},
numpages = {26},
keywords = {I/O subsystem, Accelerators, operating systems, file systems, GPU}
}



abstract for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

Abstract

Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses.

We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.

BibTeX
@article{spin19TOCS,
author = {Bergman, Shai and Brokhman, Tanya and Cohen, Tzachi and Silberstein, Mark},
title = {SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs},
year = {2019},
issue_date = {April 2019},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {36},
number = {2},
issn = {0734-2071},
url = {https://doi.org/10.1145/3309987},
doi = {10.1145/3309987},
journal = {ACM Trans. Comput. Syst.},
month = apr,
articleno = {Article 5},
numpages = {26},
keywords = {I/O subsystem, Accelerators, operating systems, file systems, GPU}
}



code for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
Shai Bergman, Tanya Brokhman, Tsahi Cohen, Mark Silberstein
Extended version of the ATC'17 paper
[PACT]   Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur
BibTeX for Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

BibTeX
@INPROCEEDINGS{8891620,
author={A. {Watad} and A. {Libov} and O. {Shacham} and E. {Bortnikov} and M. {Silberstein}},
booktitle={2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
title={Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur},
year={2019},
volume={},
number={},
pages={245-257},
keywords={concurrency (computers);file servers;graphics processing units;nearest neighbour methods;parallel processing;storage management;multiGPU distributed data flow runtime;GPU management overheads;CPU load;k-NN multiGPU network service;Centaur;GPU-centric architecture;network request processing;CPU-driven server architecture;k-nearest-neighbors network server;scalability;parallel computing;high-concurrency memory-demanding server applications;Graphics processing units;Servers;Kernel;Throughput;Clustering algorithms;Approximation algorithms;Computer architecture;GPU;Parallel Computing},
doi={10.1109/PACT.2019.00027},
ISSN={1089-795X},
month={Sep.},}
abstract for Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur

Abstract

Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime eliminates GPU management overheads from the CPU, making the server throughput and response time largely agnostic to the CPU load, speed or the number of dedicated CPU cores. Our experiments systems show that our server achieves near-perfect scaling for 16 GPUs, beating the throughput of a highly-optimized CPU-driven server by 35% while maintaining about 2msec average request latency. Furthermore, it requires only a single CPU core to run, achieving over an order of magnitude higher throughput than the standard CPU-driven server architecture in this setting.

BibTeX
@INPROCEEDINGS{8891620,
author={A. {Watad} and A. {Libov} and O. {Shacham} and E. {Bortnikov} and M. {Silberstein}},
booktitle={2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
title={Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur},
year={2019},
volume={},
number={},
pages={245-257},
keywords={concurrency (computers);file servers;graphics processing units;nearest neighbour methods;parallel processing;storage management;multiGPU distributed data flow runtime;GPU management overheads;CPU load;k-NN multiGPU network service;Centaur;GPU-centric architecture;network request processing;CPU-driven server architecture;k-nearest-neighbors network server;scalability;parallel computing;high-concurrency memory-demanding server applications;Graphics processing units;Servers;Kernel;Throughput;Clustering algorithms;Approximation algorithms;Computer architecture;GPU;Parallel Computing},
doi={10.1109/PACT.2019.00027},
ISSN={1089-795X},
month={Sep.},}
slides for Achieving Scalability in a k-NN Multi-GPU Network Service with Centaur
[USENIX ATC]   GAIA: An OS Page Cache for Heterogeneous Systems
BibTeX for GAIA: An OS Page Cache for Heterogeneous Systems

GAIA: An OS Page Cache for Heterogeneous Systems

BibTeX
@inproceedings{10.5555/3358807.3358864,
author = {Brokhman, Tanya and Lifshits, Pavel and Silberstein, Mark},
title = {GAIA: An OS Page Cache for Heterogeneous Systems},
year = {2019},
isbn = {9781939133038},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference},
pages = {661–674},
numpages = {14},
location = {Renton, WA, USA},
series = {USENIX ATC ’19}
}
abstract for GAIA: An OS Page Cache for Heterogeneous Systems

GAIA: An OS Page Cache for Heterogeneous Systems

Abstract

We propose a principled approach to integrating GPU memory with an OS page cache. We design GAIA, a weakly-consistent page cache that spans CPU and GPU memories. GAIA enables the standard mmap system call to map files into the GPU address space, thereby enabling data-dependent GPU accesses to large files and efficient write-sharing between the CPU and GPUs. Under the hood, GAIA (1) integrates lazy release consistency protocol into the OS page cache while maintaining backward compatibility with CPU processes and unmodified GPU kernels; (2) improves CPU I/O performance by using data cached in GPU memory, and (3) optimizes the readahead prefetcher to support accesses to files cached in GPUs.

We prototype GAIA in Linux and evaluate it on NVIDIA Pascal GPUs. We show up to 3× speedup in CPU file I/O and up to 8× in unmodified realistic workloads such as Gunrock GPU-accelerated graph processing, image collage, and microscopy image stitching.

BibTeX
@inproceedings{10.5555/3358807.3358864,
author = {Brokhman, Tanya and Lifshits, Pavel and Silberstein, Mark},
title = {GAIA: An OS Page Cache for Heterogeneous Systems},
year = {2019},
isbn = {9781939133038},
publisher = {USENIX Association},
address = {USA},
booktitle = {Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference},
pages = {661–674},
numpages = {14},
location = {Renton, WA, USA},
series = {USENIX ATC ’19}
}
slides for GAIA: An OS Page Cache for Heterogeneous Systems video for GAIA: An OS Page Cache for Heterogeneous Systems code for GAIA: An OS Page Cache for Heterogeneous Systems Icon Other for GAIA: An OS Page Cache for Heterogeneous Systems
[USENIX ATC]   NICA: An Infrastructure for Inline Acceleration of Network Applications
BibTeX for NICA: An Infrastructure for Inline Acceleration of Network Applications

NICA: An Infrastructure for Inline Acceleration of Network Applications

BibTeX
@inproceedings {234884,
author = {Haggai Eran and Lior Zeno and Maroun Tork and Gabi Malka and Mark Silberstein},
title = {{NICA}: An Infrastructure for Inline Acceleration of Network Applications},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {345--362},
url = {https://www.usenix.org/conference/atc19/presentation/eran},
publisher = {{USENIX} Association},
month = jul,
}
abstract for NICA: An Infrastructure for Inline Acceleration of Network Applications

NICA: An Infrastructure for Inline Acceleration of Network Applications

Abstract

With rising network rates, cloud vendors increasingly deploy FPGA-based SmartNICs (F-NICs), leveraging their inline processing capabilities to offload hypervisor networking infrastructure. However, the use of F-NICs for accelerating general-purpose server applications in clouds has been limited.

NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems. A new ikernel programming abstraction, tightly integrated with the network stack, enables application control of F-NIC computations that process application network traffic, with minimal code changes. In addition, NICA’s virtualization architecture supports fine-grain time-sharing of F-NIC logic and provides I/O path virtualization. Together these features enable cost-effective sharing of F-NICs across virtual machines with strict performance guarantees.

We prototype NICA on Mellanox F-NICs and integrate ikernels with the high-performance VMA network stack and the KVM hypervisor. We demonstrate significant acceleration of real-world applications in both bare-metal and virtualized environments, while requiring only minor code modifications to accelerate them on F-NICs. For example, a transparent key-value store cache ikernel added to the stock memcached server reaches 40 Gbps server throughput (99% line-rate) at 6 μs 99th-percentile latency for 16-byte key-value pairs, which is 21× the throughput of a 6-core CPU with a kernel-bypass network stack. The throughput scales linearly for up to 6 VMs running independent instances of memcached.

BibTeX
@inproceedings {234884,
author = {Haggai Eran and Lior Zeno and Maroun Tork and Gabi Malka and Mark Silberstein},
title = {{NICA}: An Infrastructure for Inline Acceleration of Network Applications},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {345--362},
url = {https://www.usenix.org/conference/atc19/presentation/eran},
publisher = {{USENIX} Association},
month = jul,
}
slides for NICA: An Infrastructure for Inline Acceleration of Network Applications video for NICA: An Infrastructure for Inline Acceleration of Network Applications code for NICA: An Infrastructure for Inline Acceleration of Network Applications Icon Other for NICA: An Infrastructure for Inline Acceleration of Network Applications
[FCCM]   Design Patterns for Code Reuse in HLS Packet Processing Pipelines
BibTeX for Design Patterns for Code Reuse in HLS Packet Processing Pipelines

Design Patterns for Code Reuse in HLS Packet Processing Pipelines

BibTeX
@INPROCEEDINGS{8735559,
author={H. {Eran} and L. {Zeno} and Z. {István} and M. {Silberstein}},
booktitle={2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
title={Design Patterns for Code Reuse in HLS Packet Processing Pipelines},
year={2019},
volume={},
number={},
pages={208-217},
keywords={field programmable gate arrays;high level synthesis;logic design;software libraries;class library;FPGA-based SmartNICs;code reuse;HLS packet processing pipelines;high-level synthesis;high-speed networking applications;UDP stateless firewall;key-value store cache;FPGA circuits;Optimization;Tools;C++ languages;Logic gates;Hardware;Field programmable gate arrays;Data structures;High level synthesis;Design methodology;Networking;Packet processing},
doi={10.1109/FCCM.2019.00036},
ISSN={2576-2613},
month={April},}
abstract for Design Patterns for Code Reuse in HLS Packet Processing Pipelines

Design Patterns for Code Reuse in HLS Packet Processing Pipelines

Abstract

High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.

BibTeX
@INPROCEEDINGS{8735559,
author={H. {Eran} and L. {Zeno} and Z. {István} and M. {Silberstein}},
booktitle={2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)},
title={Design Patterns for Code Reuse in HLS Packet Processing Pipelines},
year={2019},
volume={},
number={},
pages={208-217},
keywords={field programmable gate arrays;high level synthesis;logic design;software libraries;class library;FPGA-based SmartNICs;code reuse;HLS packet processing pipelines;high-level synthesis;high-speed networking applications;UDP stateless firewall;key-value store cache;FPGA circuits;Optimization;Tools;C++ languages;Logic gates;Hardware;Field programmable gate arrays;Data structures;High level synthesis;Design methodology;Networking;Packet processing},
doi={10.1109/FCCM.2019.00036},
ISSN={2576-2613},
month={April},}
code for Design Patterns for Code Reuse in HLS Packet Processing Pipelines
[SFMA'19]   One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing
BibTeX for One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

BibTeX
@misc{ caladan-position,
authors={Lluis Vilanova and Yoav Etsion and Mark Silberstein},
title = {{One Interface to Rule them All: A Hardware/Software
Co-Design for Disaggregated Computing}},
series = {SFMA'19},
}
abstract for One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

One Interface to Rule them All: A Hardware/Software Co-Design for Disaggregated Computing

Abstract

Datacenters are moving towards a paradigm of pooling resources (e.g., CPUs, storage and accelerators) into separate nodes to lower costs through easier hardware upgradability and higher resource utilization when running applications with heterogeneous demands.
A single request to an application can trigger a chain of accesses to multiple devices, but each device has wildly different hardware capabilities which expose vastly different data and control interfaces. As a result, applications cannot securely span all these devices in a way that keeps the cost and simplicity benefits of disaggregation while maintaining efficiency.

 

In this paper, we propose extending NICs to implement a model of continuation-based computations inspired in dataflow, which is used to weave the execution flow of applications across hardware devices without the need for each device to know each other’s communication protocol.

To achieve this, we lean on the observation that modern technology trends like device
self-virtualization, multi-queue designs, RDMA and remote device transports (e.g., NVMe over fabric [14]) can be extended to allow devices to interact with each other without the need for intermediate software layers. Existing NICs can be easily extended to trigger such continuations as a response to device command completions, translating a continuation into a request directed at the next device on the processing pipeline.

BibTeX
@misc{ caladan-position,
authors={Lluis Vilanova and Yoav Etsion and Mark Silberstein},
title = {{One Interface to Rule them All: A Hardware/Software
Co-Design for Disaggregated Computing}},
series = {SFMA'19},
}
[USENIX ATC]   SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
BibTeX for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

BibTeX
@inproceedings {203153,
author = {Shai Bergman and Tanya Brokhman and Tzachi Cohen and Mark Silberstein},
title = {{SPIN}: Seamless Operating System Integration of Peer-to-Peer {DMA} Between SSDs and GPUs},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {167--179},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/bergman},
publisher = {{USENIX} Association},
month = jul,
}
abstract for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

Abstract

Recent GPUs enable Peer-to-Peer Direct Memory Access (P2P) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using P2P to access files is challenging because of the subtleties of low-level nonstandard interfaces, which bypass the OS file I/O layers and may hurt system performance.

SPIN integrates P2P into the standard OS file I/O stack, dynamically activating P2P where appropriate, transparently to the user. It combines P2P with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding P2P throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, and enables 3.3× higher throughput for a GPU-accelerated log server.

BibTeX
@inproceedings {203153,
author = {Shai Bergman and Tanya Brokhman and Tzachi Cohen and Mark Silberstein},
title = {{SPIN}: Seamless Operating System Integration of Peer-to-Peer {DMA} Between SSDs and GPUs},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {167--179},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/bergman},
publisher = {{USENIX} Association},
month = jul,
}
slides for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs code for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
Shai Bergman, Tanya Brokhman, Tsahi Cohen, Mark Silberstein
[HotOS]   OmniX: an accelerator-centric OS for omni-programmable systems
BibTeX for OmniX: an accelerator-centric OS for omni-programmable systems

OmniX: an accelerator-centric OS for omni-programmable systems

BibTeX
@inproceedings{OmniX,
author = {Silberstein, Mark},
title = {OmniX: An Accelerator-Centric OS for Omni-Programmable Systems},
year = {2017},
isbn = {9781450350686},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3102980.3102992},
doi = {10.1145/3102980.3102992},
booktitle = {Proceedings of the 16th Workshop on Hot Topics in Operating Systems},
pages = {69–75},
numpages = {7},
location = {Whistler, BC, Canada},
series = {HotOS ’17}
}



abstract for OmniX: an accelerator-centric OS for omni-programmable systems

OmniX: an accelerator-centric OS for omni-programmable systems

Abstract

Future systems will be omni-programmable: alongside CPUs, GPUs and FPGAs,
they will execute user code near-storage, near-network, near-memory, or on other
Near-X accelerator Units, NXUs}.
This paper explores the design space of OS support for omni-programmable systems,
aiming to simplify the development of efficient applications that span multiple
heterogeneous processors and near-data accelerators.
OmniX is an accelerator-centric OS architecture that extends standard OS
abstractions, such as task execution and I/O, into NXUs  while maintaining a coherent view of the system among all the processors.OmniX enables NXUs to directly invoke
tasks and access I/O services among themselves, excluding the CPU from the performance-critical
control plane operations. The host CPU serves as a controller — for protection,
device configuration and monitoring.  We discuss the hardware trends
that motivate our work, outline OmniX design principles, and sketch the core implementation ideas while highlighting missing hardware features, in the hope of motivating hardware vendors to implement them soon.

BibTeX
@inproceedings{OmniX,
author = {Silberstein, Mark},
title = {OmniX: An Accelerator-Centric OS for Omni-Programmable Systems},
year = {2017},
isbn = {9781450350686},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3102980.3102992},
doi = {10.1145/3102980.3102992},
booktitle = {Proceedings of the 16th Workshop on Hot Topics in Operating Systems},
pages = {69–75},
numpages = {7},
location = {Whistler, BC, Canada},
series = {HotOS ’17}
}



slides for OmniX: an accelerator-centric OS for omni-programmable systems
[EuroSys]   Eleos: Exit-Less OS Services for SGX Enclaves
BibTeX for Eleos: Exit-Less OS Services for  SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves

BibTeX
@inproceedings{Eleos,
author = {Orenbach, Meni and Lifshits, Pavel and Minkin, Marina and Silberstein, Mark},
title = {Eleos: ExitLess OS Services for SGX Enclaves},
year = {2017},
isbn = {9781450349383},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3064176.3064219},
doi = {10.1145/3064176.3064219},
booktitle = {Proceedings of the Twelfth European Conference on Computer Systems},
pages = {238–253},
numpages = {16},
location = {Belgrade, Serbia},
series = {EuroSys ’17}
}



abstract for Eleos: Exit-Less OS Services for  SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves

Abstract

Intel Software Guard eXtensions (SGX) enable secure and trusted execution of user code in an isolated enclave to protect against a powerful adversary. Unfortunately, running I/O-intensive, memory-demanding server applications in enclaves leads to significant performance degradation. Such applications put a substantial load on the in-enclave system call and secure paging mechanisms, which turn out to be the main reason for the application slowdown. In addition to the
high direct cost of thousands-of-cycles long SGX management instructions, these mechanisms incur the high indirect cost of enclave exits due to associated TLB flushes and processor state pollution.
We tackle these performance issues in Eleos by enabling exit-less system calls and exit-less paging in enclaves. Eleos introduces a novel Secure User-managed Virtual Memory (SUVM) abstraction that implements application-level paging inside the enclave. SUVM eliminates the overheads of
enclave exits due to paging, and enables new optimizations such as sub-page granularity of accesses.  We thoroughly evaluate Eleos on a range of microbenchmarks and two real server applications, achieving notable system performance gains. memcached and a face verification server running in-enclave with Eleos, achieves up to 2.2× and 2.3× higher throughput respectively while working on datasets up to 5× larger than the enclave’s secure physical memory.

BibTeX
@inproceedings{Eleos,
author = {Orenbach, Meni and Lifshits, Pavel and Minkin, Marina and Silberstein, Mark},
title = {Eleos: ExitLess OS Services for SGX Enclaves},
year = {2017},
isbn = {9781450349383},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3064176.3064219},
doi = {10.1145/3064176.3064219},
booktitle = {Proceedings of the Twelfth European Conference on Computer Systems},
pages = {238–253},
numpages = {16},
location = {Belgrade, Serbia},
series = {EuroSys ’17}
}



slides for Eleos: Exit-Less OS Services for  SGX Enclaves code for Eleos: Exit-Less OS Services for  SGX Enclaves
Meni Orenbach, Marina Minkin, Tsahi Cohen, Mark Silberstein
[ACM TOCS]   GPUnet: networking abstractions for GPUs
BibTeX for GPUnet: networking abstractions for GPUs

GPUnet: networking abstractions for GPUs

BibTeX
@article{silberstein16GPUnet
author = {Silberstein, Mark and Kim, Sangman and Huh, Seonggu and Zhang, Xinya and Hu, Yige and Wated, Amir and Witchel, Emmett},
title = {GPUnet: Networking Abstractions for GPU Programs},
year = {2016},
issue_date = {September 2016},
publisher = {ACM},
address = {New York, NY, USA},
volume = {34},
number = {3},
issn = {0734-2071},
doi = {10.1145/2963098},
journal = {ACM Transactions on Computer Systems},
month = sep,
articleno = {Article 9},
numpages = {31},
}
abstract for GPUnet: networking abstractions for GPUs

GPUnet: networking abstractions for GPUs

Abstract

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

BibTeX
@article{silberstein16GPUnet
author = {Silberstein, Mark and Kim, Sangman and Huh, Seonggu and Zhang, Xinya and Hu, Yige and Wated, Amir and Witchel, Emmett},
title = {GPUnet: Networking Abstractions for GPU Programs},
year = {2016},
issue_date = {September 2016},
publisher = {ACM},
address = {New York, NY, USA},
volume = {34},
number = {3},
issn = {0734-2071},
doi = {10.1145/2963098},
journal = {ACM Transactions on Computer Systems},
month = sep,
articleno = {Article 9},
numpages = {31},
}
Mark Silberstein, Sangman Kim, Amir Watad, Yige Hu, Xinya Zhang, Seonggu Huh, Emmett Witchel
Extended version of the OSDI'14 paper, Fast-track acceptance
[ROSS]   GPUrdma: GPU-side library for high performance networking from GPU kernels
BibTeX for GPUrdma: GPU-side library for high performance networking from GPU kernels

GPUrdma: GPU-side library for high performance networking from GPU kernels

BibTeX
@inproceedings{ross16gpurdma,
author = {Daoud, Feras and Watad, Amir and Silberstein, Mark},
title = {GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels},
year = {2016},
isbn = {9781450343879},
publisher = {ACM},
url = {https://doi.org/10.1145/2931088.2931091},
doi = {10.1145/2931088.2931091},
booktitle = {Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers},
articleno = {Article 6},
numpages = {8},
keywords = {Networking, accelerators, Operating Systems Design, GPGPUs},
location = {Kyoto, Japan},
series = {ROSS ’16}
}

abstract for GPUrdma: GPU-side library for high performance networking from GPU kernels

GPUrdma: GPU-side library for high performance networking from GPU kernels

Abstract

We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail.

We achieve 5usec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware.

We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications — ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors — each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables a further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs.

GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, a heterogeneous networking infrastructure.

BibTeX
@inproceedings{ross16gpurdma,
author = {Daoud, Feras and Watad, Amir and Silberstein, Mark},
title = {GPUrdma: GPU-Side Library for High Performance Networking from GPU Kernels},
year = {2016},
isbn = {9781450343879},
publisher = {ACM},
url = {https://doi.org/10.1145/2931088.2931091},
doi = {10.1145/2931088.2931091},
booktitle = {Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers},
articleno = {Article 6},
numpages = {8},
keywords = {Networking, accelerators, Operating Systems Design, GPGPUs},
location = {Kyoto, Japan},
series = {ROSS ’16}
}

slides for GPUrdma: GPU-side library for high performance networking from GPU kernels
Feras Daoud, Amir Watad, Mark Silberstein
Best Paper Award
[SYSTOR]   Supporting data-driven I/O on GPUs using GPUfs
BibTeX for Supporting data-driven I/O on GPUs using GPUfs

Supporting data-driven I/O on GPUs using GPUfs

BibTeX
@inproceedings{gpufs16systor,
author = {Shahar, Sagi and Silberstein, Mark},
title = {Supporting Data-Driven I/O on GPUs Using GPUfs},
year = {2016},
isbn = {9781450343817},
publisher = {ACM},
url = {https://doi.org/10.1145/2928275.2928276},
doi = {10.1145/2928275.2928276},
booktitle = {Proceedings of the 9th ACM International on Systems and Storage Conference},
articleno = {Article 12},
numpages = {11},
keywords = {GPGPUs, Operating Systems, File Systems},
location = {Haifa, Israel},
series = {SYSTOR ’16}
}



abstract for Supporting data-driven I/O on GPUs using GPUfs

Supporting data-driven I/O on GPUs using GPUfs

Abstract

Using discrete GPUs for processing very large datasets is challenging, in particular when an algorithm exhibit unpredictable, data-driven access patterns. In this paper, we investigate the utility of GPUfs, a library that provides direct access to files from GPU programs, to implement such algorithms. We analyze the system’s bottlenecks, and suggest several modifications to the GPUfs design, including new concurrent hash table for the buffer cache and a highly parallel memory allocator. We also show that by implementing the workload in a warp-centric manner we can improve the performance even further. We evaluate our changes by implementing a real image processing application which creates collages from a dataset of 10 Million images. The enhanced GPUfs design improves the application performance by 5.6× on average over the original GPUfs, and outperforms both 12-core parallel CPU which uses the AVX instruction set, and a standard CUDA-based GPU implementation by up to 2.5× and 3× respectively, while significantly enhancing system programmability and simplifying the application design and implementation.

BibTeX
@inproceedings{gpufs16systor,
author = {Shahar, Sagi and Silberstein, Mark},
title = {Supporting Data-Driven I/O on GPUs Using GPUfs},
year = {2016},
isbn = {9781450343817},
publisher = {ACM},
url = {https://doi.org/10.1145/2928275.2928276},
doi = {10.1145/2928275.2928276},
booktitle = {Proceedings of the 9th ACM International on Systems and Storage Conference},
articleno = {Article 12},
numpages = {11},
keywords = {GPGPUs, Operating Systems, File Systems},
location = {Haifa, Israel},
series = {SYSTOR ’16}
}



Sagi Shachar, Mark Silberstein
[ISCA]   ActivePointers: A Case for Software Address Translation on GPUs
BibTeX for ActivePointers: A Case for Software Address Translation on GPUs

ActivePointers: A Case for Software Address Translation on GPUs

BibTeX
@INPROCEEDINGS{activepointers16isca,
author={Shahar, Sagi and Bergman, Shai and Silberstein, Mark},
booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
title={{ActivePointers: A Case for Software Address Translation on GPUs}},
year={2016},
pages={596-608}
series = {ISCA'16}
}
abstract for ActivePointers: A Case for Software Address Translation on GPUs

ActivePointers: A Case for Software Address Translation on GPUs

Abstract

Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs.

We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp.

We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application’s runtime, while enabling speedups of up to 3.9× over a combined CPU+GPU implementation and 2.6× over a 12-core CPU-only implementation which uses AVX vector instructions.

BibTeX
@INPROCEEDINGS{activepointers16isca,
author={Shahar, Sagi and Bergman, Shai and Silberstein, Mark},
booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
title={{ActivePointers: A Case for Software Address Translation on GPUs}},
year={2016},
pages={596-608}
series = {ISCA'16}
}
slides for ActivePointers: A Case for Software Address Translation on GPUs code for ActivePointers: A Case for Software Address Translation on GPUs
Sagi Shachar, Shai Bergman, Mark Silberstein
[GPGPU]   GPUpIO: The Case for I/O-Driven Preemption on GPUs
BibTeX for GPUpIO: The Case for I/O-Driven Preemption on GPUs

GPUpIO: The Case for I/O-Driven Preemption on GPUs

BibTeX
@inproceedings{GPUPIO16GPGPU,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
publisher = {ACM},
@inproceedings{10.1145/2884045.2884053,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
isbn = {9781450341950},
publisher = {ACM},
url = {https://doi.org/10.1145/2884045.2884053},
doi = {10.1145/2884045.2884053},
booktitle = {Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit},
pages = {63–71},
numpages = {9},
keywords = {accelerators, GPGPUs, operating systems design, file systems, source-to-source compiliation},
location = {Barcelona, Spain},
series = {GPGPU ’16}
}
abstract for GPUpIO: The Case for I/O-Driven Preemption on GPUs

GPUpIO: The Case for I/O-Driven Preemption on GPUs

Abstract

As GPUs become general purpose, they are outgrowing the coprocessor model and require convenient I/O abstractions such as files and network sockets. Recent studies have shown the benefits of native GPU I/O layers, in terms of both programmability and performance. However, due to lack of hardware support, the GPU threads performing I/O calls are forced to busy-wait for the completion of I/O operations, resulting in underutilized hardware, higher power consumption, and reduced system throughput.

We argue that I/O-driven preemption improves the performance of existing solutions, despite many challenging system characteristics such as a large kernel state. We analyze the benefits of adding preemption support using a simple system performance model, and, encouraged by the results, explore the design of a software-based preemption mechanism for GPUs. In our prototype, GPUpIO, we implement a source-to-source compiler for state checkpoint and restoration, and a runtime library for scheduling preempted thread-blocks, which together enable I/O-driven preemption for GPUs.

We evaluate our prototype across a variety of system parameters and workloads to determine when preemption is worthwhile. We show that in some workloads the I/O-driven preemption approach may indeed double the effective system throughput by completely hiding the I/O latency behind computations. However, we also observe that the software-only solution is currently limited, not only due to its overheads, but also because it does not have sufficient control of the hardware scheduler queue and therefore may lead to starvation of I/O kernels. We then discuss a new hardware feature that, if added, may render a general I/O-driven preemption mechanism on GPUs practical.

BibTeX
@inproceedings{GPUPIO16GPGPU,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
publisher = {ACM},
@inproceedings{10.1145/2884045.2884053,
author = {Zeno, Lior and Mendelson, Avi and Silberstein, Mark},
title = {GPUpIO: The Case for I/O-Driven Preemption on GPUs},
year = {2016},
isbn = {9781450341950},
publisher = {ACM},
url = {https://doi.org/10.1145/2884045.2884053},
doi = {10.1145/2884045.2884053},
booktitle = {Proceedings of the 9th Annual Workshop on General Purpose Processing Using Graphics Processing Unit},
pages = {63–71},
numpages = {9},
keywords = {accelerators, GPGPUs, operating systems design, file systems, source-to-source compiliation},
location = {Barcelona, Spain},
series = {GPGPU ’16}
}
Lior Zeno, Avi Mendelson, Mark Silberstein
[OSDI]   GPUnet: Networking Abstractions for GPU Programs
BibTeX for GPUnet: Networking Abstractions for GPU Programs

GPUnet: Networking Abstractions for GPU Programs

BibTeX
@inproceedings {gpunet14osdi,
author = {Sangman Kim and Seonggu Huh and Xinya Zhang and Yige Hu and Amir Wated and Emmett Witchel and Mark Silberstein},
title = {GPUnet: Networking Abstractions for {GPU} Programs},
booktitle = {11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14)},
year = {2014},
isbn = { 978-1-931971-16-4},
address = {Broomfield, CO},
pages = {201--216},
url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/kim},
publisher = {{USENIX} Association},
month = oct,
}
abstract for GPUnet: Networking Abstractions for GPU Programs

GPUnet: Networking Abstractions for GPU Programs

Abstract

Despite the popularity of GPUs in high-performance and scientific computing, and despite increasingly general-purpose hardware capabilities, the use of GPUs in network servers or distributed systems poses significant challenges.

GPUnet is a native GPU networking layer that provides a socket abstraction and high-level networking APIs for GPU programs. We use GPUnet to streamline the development of high-performance, distributed applications like in-GPU-memory MapReduce and a new class of low-latency, high-throughput GPU-native network services such as a face verification server.

BibTeX
@inproceedings {gpunet14osdi,
author = {Sangman Kim and Seonggu Huh and Xinya Zhang and Yige Hu and Amir Wated and Emmett Witchel and Mark Silberstein},
title = {GPUnet: Networking Abstractions for {GPU} Programs},
booktitle = {11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14)},
year = {2014},
isbn = { 978-1-931971-16-4},
address = {Broomfield, CO},
pages = {201--216},
url = {https://www.usenix.org/conference/osdi14/technical-sessions/presentation/kim},
publisher = {{USENIX} Association},
month = oct,
}
slides for GPUnet: Networking Abstractions for GPU Programs code for GPUnet: Networking Abstractions for GPU Programs
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Watad, Mark Silberstein
[CACM]   GPUfs: the case for operating system services on GPUs
BibTeX for GPUfs: the case for operating system services on GPUs

GPUfs: the case for operating system services on GPUs

BibTeX
@article{gpufs14cacm,
author = {Silberstein, Mark and Ford, Bryan and Witchel, Emmett},
title = {GPUfs: The Case for Operating System Services on GPUs},
year = {2014},
issue_date = {November 2014},
publisher = {ACM},
address = {New York, NY, USA},
volume = {57},
number = {12},
issn = {0001-0782},
url = {https://doi.org/10.1145/2656206},
doi = {10.1145/2656206},
journal = {Commun. ACM},
month = nov,
pages = {68–79},
numpages = {12}
}



abstract for GPUfs: the case for operating system services on GPUs

GPUfs: the case for operating system services on GPUs

Abstract

This is a non-technical article that covers the main aspects of the GPUfs file system layer for GPU software that makes operating system abstractions available to GPU code.

BibTeX
@article{gpufs14cacm,
author = {Silberstein, Mark and Ford, Bryan and Witchel, Emmett},
title = {GPUfs: The Case for Operating System Services on GPUs},
year = {2014},
issue_date = {November 2014},
publisher = {ACM},
address = {New York, NY, USA},
volume = {57},
number = {12},
issn = {0001-0782},
url = {https://doi.org/10.1145/2656206},
doi = {10.1145/2656206},
journal = {Commun. ACM},
month = nov,
pages = {68–79},
numpages = {12}
}



Mark Silberstein, Bryan Ford, Emmett Witchel
Invited to Communication of ACM
[ACM TOCS]   GPUfs: Integrating a file system with GPUs
BibTeX for GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

BibTeX
@article{gpufs14tocs,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2014},
issue_date = {February 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {32},
number = {1},
issn = {0734-2071},
url = {https://doi.org/10.1145/2553081},
doi = {10.1145/2553081},
journal = {ACM Trans. Comput. Syst.},
month = feb,
articleno = {Article 1},
numpages = {31},
keywords = {operating systems, GPGPUs, operating systems design, file systems, Accelerators}
}



abstract for GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs

Abstract

As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order to facilitate program development and enable harmonious integration of GPUs in computing systems. As an example, we describe the design and implementation of GPUFs, a software layer which provides operating system support for accessing host files directly from GPU programs. GPUFs provides a POSIX-like API, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the host CPU’s buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adapted to use our file system, demonstrate the feasibility and benefits of the GPUFs approach. For example, a self-contained GPU program that searches for a set of strings throughout the Linux kernel source tree runs over seven times faster than on an eight-core CPU.

BibTeX
@article{gpufs14tocs,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2014},
issue_date = {February 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {32},
number = {1},
issn = {0734-2071},
url = {https://doi.org/10.1145/2553081},
doi = {10.1145/2553081},
journal = {ACM Trans. Comput. Syst.},
month = feb,
articleno = {Article 1},
numpages = {31},
keywords = {operating systems, GPGPUs, operating systems design, file systems, Accelerators}
}



Mark Silberstein, Bryan Ford, Idit Keidar, Emmett Witchel
Extended version of the ASPLOS'13 paper, Fast-track acceptance
[ACM UBIQUITY]   GPUs: High-performance Accelerators for Parallel Applications.
BibTeX for GPUs: High-performance Accelerators for Parallel Applications.

GPUs: High-performance Accelerators for Parallel Applications.

BibTeX
@article{uniquity,
author = {Silberstein, Mark},
title = {GPUs: High-Performance Accelerators for Parallel Applications: The Multicore Transformation (Ubiquity Symposium)},
year = {2014},
issue_date = {August 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2014},
number = {August},
url = {https://doi.org/10.1145/2618401},
doi = {10.1145/2618401},
journal = {Ubiquity},
month = aug,
articleno = {Article 1},
numpages = {13}
}



abstract for GPUs: High-performance Accelerators for Parallel Applications.

GPUs: High-performance Accelerators for Parallel Applications.

Abstract

Early graphical processing units (GPUs) were designed as high compute density, fixed-function processors ideally crafted to the needs of computer graphics workloads. Today, GPUs are becoming truly first-class computing elements on par with CPUs. Programming GPUs as self-sufficient general-purpose processors is not only hypothetically desirable, but feasible and efficient in practice, opening new opportunities for integration of GPUs in complex software systems.

BibTeX
@article{uniquity,
author = {Silberstein, Mark},
title = {GPUs: High-Performance Accelerators for Parallel Applications: The Multicore Transformation (Ubiquity Symposium)},
year = {2014},
issue_date = {August 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2014},
number = {August},
url = {https://doi.org/10.1145/2618401},
doi = {10.1145/2618401},
journal = {Ubiquity},
month = aug,
articleno = {Article 1},
numpages = {13}
}



Mark Silberstein
Invited to Ubiquity Symposium on Parallel Computing
[ASPLOS]   GPUfs: integrating a file system with GPUs
BibTeX for GPUfs: integrating a file system with GPUs

GPUfs: integrating a file system with GPUs

BibTeX
@inproceedings{gpufs13ASPLOS,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2013},
isbn = {9781450318709},
publisher = {ACM},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2451116.2451169},
doi = {10.1145/2451116.2451169},
booktitle = {Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {485–498},
numpages = {14},
keywords = {accelerators, operating systems design, file systems, gpgpus},
location = {Houston, Texas, USA},
series = {ASPLOS ’13}
}



abstract for GPUfs: integrating a file system with GPUs

GPUfs: integrating a file system with GPUs

Abstract

GPU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host’s file system directly accessible from GPU code.

GPUfs provides a POSIX-like API for GPU programs, exploits GPU parallelism for efficiency, and optimizes GPU file access by extending the buffer cache into GPU memory. Our experiments, based on a set of real benchmarks adopted to use our file system, demonstrate the feasibility and benefits of our approach. For example, we demonstrate a simple self-contained GPU program which searches for a set of strings in the entire tree of Linux kernel source files over seven times faster than an eight-core CPU run.

BibTeX
@inproceedings{gpufs13ASPLOS,
author = {Silberstein, Mark and Ford, Bryan and Keidar, Idit and Witchel, Emmett},
title = {GPUfs: Integrating a File System with GPUs},
year = {2013},
isbn = {9781450318709},
publisher = {ACM},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/2451116.2451169},
doi = {10.1145/2451116.2451169},
booktitle = {Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems},
pages = {485–498},
numpages = {14},
keywords = {accelerators, operating systems design, file systems, gpgpus},
location = {Houston, Texas, USA},
series = {ASPLOS ’13}
}



code for GPUfs: integrating a file system with GPUs

OS Services for Trusted Execution Environments

[EuroSys]   Autarky: Closing controlled channels with self-paging enclaves
BibTeX for Autarky: Closing controlled channels with self-paging enclaves

Autarky: Closing controlled channels with self-paging enclaves

BibTeX
@inproceedings{ autarky20eurosys,
authors={ Orenbach, Meni and Baumann, Andrew and Silberstein, Mark},
title = {{Autarky: Closing controlled channels with self-paging enclaves},
booktitle={ Fifteenth European Conference on Computer Systems, Heraklion, Greece},
series= {EuroSys ’20},
year= {2020},
}
abstract for Autarky: Closing controlled channels with self-paging enclaves

Autarky: Closing controlled channels with self-paging enclaves

Abstract

As the first widely-deployed secure enclave hardware, Intel SGX shows promise as a practical basis for confidential cloud computing. However, side channels remain SGX’s greatest security weakness. In particular, the “controlled-channel attack” on enclave page faults exploits a longstanding architectural side channel and still lacks effective mitigation.

We propose Autarky: a set of minor, backward-compatible modifications to the SGX ISA that hide an enclave’s page access trace from the host, and give the enclave full control over its page faults. A trusted library OS implements enclave self-paging policy.

We prototype Autarky on current SGX hardware and the Graphene library OS, implementing three paging schemes: a fast software oblivious RAM system made practical by leveraging the proposed ISA, a novel page cluster abstraction for application-aware secure self-paging, and a rate-limiting paging mechanism for unmodified binaries. Overall, Autarky provides a comprehensive defense for controlled-channel attacks which supports efficient secure demand paging, and adds no overheads in page-fault free execution.

BibTeX
@inproceedings{ autarky20eurosys,
authors={ Orenbach, Meni and Baumann, Andrew and Silberstein, Mark},
title = {{Autarky: Closing controlled channels with self-paging enclaves},
booktitle={ Fifteenth European Conference on Computer Systems, Heraklion, Greece},
series= {EuroSys ’20},
year= {2020},
}
[USENIX ATC]   CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves
BibTeX for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

BibTeX
@inproceedings {234958,
author = {Meni Orenbach and Yan Michalevsky and Christof Fetzer and Mark Silberstein},
title = {CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {555--570},
url = {https://www.usenix.org/conference/atc19/presentation/orenbach},
publisher = {{USENIX} Association},
month = jul,
}
abstract for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves

Abstract

Hardware secure enclaves are increasingly used to run complex applications. Unfortunately, existing and emerging enclave architectures do not allow secure and efficient implementation of custom page fault handlers. This limitation impedes in-enclave use of secure memory-mapped files and prevents extensions of the application memory layer commonly used in untrusted systems, such as transparent memory compression or access to remote memory.

CoSMIX is a Compiler-based system for Secure Memory Instrumentation and eXecution of applications in secure enclaves. A novel memory store abstraction allows implementation of application-level secure page fault handlers that are invoked by a lightweight enclave runtime. The CoSMIX compiler instruments the application memory accesses to use one or more memory stores, guided by a global instrumentation policy or code annotations without changing application code.

The CoSMIX prototype runs on Intel SGX and is compatible with popular SGX execution environments, including SCONE and Graphene. Our evaluation of several production applications shows how CoSMIX improves their security and performance by recompiling them with appropriate memory stores. For example, unmodified Redis and Memcached key-value stores achieve about 2× speedup by using a self-paging memory store while working on datasets up to 6× larger than the enclave’s secure memory. Similarly, annotating a single line of code in a biometric verification server changes it to store its sensitive data in Oblivious RAM and makes it resilient against SGX side-channel attacks.

BibTeX
@inproceedings {234958,
author = {Meni Orenbach and Yan Michalevsky and Christof Fetzer and Mark Silberstein},
title = {CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves},
booktitle = {2019 {USENIX} Annual Technical Conference ({USENIX} {ATC} 19)},
year = {2019},
isbn = {978-1-939133-03-8},
address = {Renton, WA},
pages = {555--570},
url = {https://www.usenix.org/conference/atc19/presentation/orenbach},
publisher = {{USENIX} Association},
month = jul,
}
slides for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves video for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves code for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves Icon Other for CoSMIX: A Compiler-based System for Secure Memory Instrumentation and Execution in Enclaves
[USENIX Security]   Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution
BibTeX for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

BibTeX
@inproceedings{vanbulck2018foreshadow,
author = {Van Bulck, Jo and Minkin, Marina and Weisse, Ofir and Genkin, Daniel and Kasikci, Baris and
Piessens, Frank and Silberstein, Mark and Wenisch, Thomas F. and Yarom, Yuval and Strackx, Raoul},
title = {Foreshadow: Extracting the Keys to the {Intel SGX} Kingdom with Transient Out-of-Order Execution},
booktitle = {Proceedings of the 27th {USENIX} Security Symposium},
year = {2018},
month = {August},
publisher = {{USENIX} Association},
note={See also technical report Foreshadow-NG~cite{weisse2018foreshadowNG}}
}
abstract for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Abstract

Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also real-world industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms.

We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave’s code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel’s vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.

BibTeX
@inproceedings{vanbulck2018foreshadow,
author = {Van Bulck, Jo and Minkin, Marina and Weisse, Ofir and Genkin, Daniel and Kasikci, Baris and
Piessens, Frank and Silberstein, Mark and Wenisch, Thomas F. and Yarom, Yuval and Strackx, Raoul},
title = {Foreshadow: Extracting the Keys to the {Intel SGX} Kingdom with Transient Out-of-Order Execution},
booktitle = {Proceedings of the 27th {USENIX} Security Symposium},
year = {2018},
month = {August},
publisher = {{USENIX} Association},
note={See also technical report Foreshadow-NG~cite{weisse2018foreshadowNG}}
}
slides for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution project for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution
Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, Raoul Strackx
First Prize in CSAW Regional Competition
[USENIX ATC]   Varys: Protecting SGX enclaves from practical side-channel attacks
BibTeX for Varys: Protecting SGX enclaves from practical side-channel attacks

Varys: Protecting SGX enclaves from practical side-channel attacks

BibTeX
@inproceedings {216033,
author = {Oleksii Oleksenko and Bohdan Trach and Robert Krahn and Mark Silberstein and Christof Fetzer},
title = {Varys: Protecting {SGX} Enclaves from Practical Side-Channel Attacks},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {ISBN 978-1-939133-01-4},
address = {Boston, MA},
pages = {227--240},
url = {https://www.usenix.org/conference/atc18/presentation/oleksenko},
publisher = {{USENIX} Association},
month = jul,
}
abstract for Varys: Protecting SGX enclaves from practical side-channel attacks

Varys: Protecting SGX enclaves from practical side-channel attacks

Abstract

Numerous recent works have experimentally shown that Intel Software Guard Extensions (SGX) are vulnerable to cache timing and page table side-channel attacks which could be used to circumvent the data confidentiality guarantees provided by SGX. Existing mechanisms that protect against these attacks either incur high execution costs, are ineffective against certain attack variants, or require significant code modifications.

We present Varys, a system that protects unmodified programs running in SGX enclaves from cache timing and page table side-channel attacks. Varys takes a pragmatic approach of strict reservation of physical cores to security-sensitive threads, thereby preventing the attacker from accessing shared CPU resources during enclave execution. The key challenge that we are addressing is that of maintaining the core reservation in the presence of an untrusted OS.

Varys fully protects against all L1/L2 cache timing attacks and significantly raises the bar for page table side-channel attacks – all with only 15% overhead on average for Phoenix and PARSEC benchmarks. Additionally, we propose a set of minor hardware extensions that hold the potential to extend Varys’ security guarantees to L3 cache and further improve its performance.

BibTeX
@inproceedings {216033,
author = {Oleksii Oleksenko and Bohdan Trach and Robert Krahn and Mark Silberstein and Christof Fetzer},
title = {Varys: Protecting {SGX} Enclaves from Practical Side-Channel Attacks},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {ISBN 978-1-939133-01-4},
address = {Boston, MA},
pages = {227--240},
url = {https://www.usenix.org/conference/atc18/presentation/oleksenko},
publisher = {{USENIX} Association},
month = jul,
}
slides for Varys: Protecting SGX enclaves from practical side-channel attacks
Oleksii Oleksenki, Bohdan Trach, Robert Krahn, Andre Martin, Mark Silberstein, Christof Fetzer
[EuroSys]   Eleos: Exit-Less OS Services for SGX Enclaves
BibTeX for Eleos: Exit-Less OS Services for  SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves

BibTeX
@inproceedings{Eleos,
author = {Orenbach, Meni and Lifshits, Pavel and Minkin, Marina and Silberstein, Mark},
title = {Eleos: ExitLess OS Services for SGX Enclaves},
year = {2017},
isbn = {9781450349383},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3064176.3064219},
doi = {10.1145/3064176.3064219},
booktitle = {Proceedings of the Twelfth European Conference on Computer Systems},
pages = {238–253},
numpages = {16},
location = {Belgrade, Serbia},
series = {EuroSys ’17}
}



abstract for Eleos: Exit-Less OS Services for  SGX Enclaves

Eleos: Exit-Less OS Services for SGX Enclaves

Abstract

Intel Software Guard eXtensions (SGX) enable secure and trusted execution of user code in an isolated enclave to protect against a powerful adversary. Unfortunately, running I/O-intensive, memory-demanding server applications in enclaves leads to significant performance degradation. Such applications put a substantial load on the in-enclave system call and secure paging mechanisms, which turn out to be the main reason for the application slowdown. In addition to the
high direct cost of thousands-of-cycles long SGX management instructions, these mechanisms incur the high indirect cost of enclave exits due to associated TLB flushes and processor state pollution.
We tackle these performance issues in Eleos by enabling exit-less system calls and exit-less paging in enclaves. Eleos introduces a novel Secure User-managed Virtual Memory (SUVM) abstraction that implements application-level paging inside the enclave. SUVM eliminates the overheads of
enclave exits due to paging, and enables new optimizations such as sub-page granularity of accesses.  We thoroughly evaluate Eleos on a range of microbenchmarks and two real server applications, achieving notable system performance gains. memcached and a face verification server running in-enclave with Eleos, achieves up to 2.2× and 2.3× higher throughput respectively while working on datasets up to 5× larger than the enclave’s secure physical memory.

BibTeX
@inproceedings{Eleos,
author = {Orenbach, Meni and Lifshits, Pavel and Minkin, Marina and Silberstein, Mark},
title = {Eleos: ExitLess OS Services for SGX Enclaves},
year = {2017},
isbn = {9781450349383},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3064176.3064219},
doi = {10.1145/3064176.3064219},
booktitle = {Proceedings of the Twelfth European Conference on Computer Systems},
pages = {238–253},
numpages = {16},
location = {Belgrade, Serbia},
series = {EuroSys ’17}
}



slides for Eleos: Exit-Less OS Services for  SGX Enclaves code for Eleos: Exit-Less OS Services for  SGX Enclaves
Meni Orenbach, Marina Minkin, Tsahi Cohen, Mark Silberstein
[SysTex]   SGX Enclaves as Accelerators
BibTeX for SGX Enclaves as Accelerators

SGX Enclaves as Accelerators

BibTeX
@inproceedings{orenbach16systex,
title={SGX Enclaves as Accelerators},
author={Meni Orenbach and Mark Silberstein},
booktitle={1st Workshop on System Software for Trusted Execution},
year={2016},
}
abstract for SGX Enclaves as Accelerators

SGX Enclaves as Accelerators

Abstract

Intel SGX enclaves is a novel technology that holds the promise to revolutionize the way secure and trustworthy applications are built. However, from the perspective of interaction with the rest
of the system, some of the enclave’s characteristics are remarkably similar to the characteristics of traditional hardware accelerators, such as GPUs. For example, enclaves suffer from significant in-
vocation overheads, offer space-constrained private memory, and cannot directly invoke OS services such as network or file I/O. Over the course of GPU computing evolution, there have been developed many techniques to improve system performance and programmability. Our key observation is that the conceptual similarities between enclaves and accelerators may help to build efficient runtime support for enclaves by learning from past experience with GPUs.
We demonstrate this simple idea by implementing SGXIO, a simple yet powerful enhancement to the current SGX runtime which boosts the performance of I/O system calls from enclaves. SGXIO
design is almost identical to the design of GPUfs and GPUnet systems for efficient I/O services for GPU programs. Our preliminary evaluation shows that GXIO improves the performance of
a simple network parameter server for distributed machine learning by up to 3.7×. These promising results suggest new ways to design more efficient runtime and system services for enclaves.

BibTeX
@inproceedings{orenbach16systex,
title={SGX Enclaves as Accelerators},
author={Meni Orenbach and Mark Silberstein},
booktitle={1st Workshop on System Software for Trusted Execution},
year={2016},
}
slides for SGX Enclaves as Accelerators

Harware Side Channels

[EuroSys]   Autarky: Closing controlled channels with self-paging enclaves
BibTeX for Autarky: Closing controlled channels with self-paging enclaves

Autarky: Closing controlled channels with self-paging enclaves

BibTeX
@inproceedings{ autarky20eurosys,
authors={ Orenbach, Meni and Baumann, Andrew and Silberstein, Mark},
title = {{Autarky: Closing controlled channels with self-paging enclaves},
booktitle={ Fifteenth European Conference on Computer Systems, Heraklion, Greece},
series= {EuroSys ’20},
year= {2020},
}
abstract for Autarky: Closing controlled channels with self-paging enclaves

Autarky: Closing controlled channels with self-paging enclaves

Abstract

As the first widely-deployed secure enclave hardware, Intel SGX shows promise as a practical basis for confidential cloud computing. However, side channels remain SGX’s greatest security weakness. In particular, the “controlled-channel attack” on enclave page faults exploits a longstanding architectural side channel and still lacks effective mitigation.

We propose Autarky: a set of minor, backward-compatible modifications to the SGX ISA that hide an enclave’s page access trace from the host, and give the enclave full control over its page faults. A trusted library OS implements enclave self-paging policy.

We prototype Autarky on current SGX hardware and the Graphene library OS, implementing three paging schemes: a fast software oblivious RAM system made practical by leveraging the proposed ISA, a novel page cluster abstraction for application-aware secure self-paging, and a rate-limiting paging mechanism for unmodified binaries. Overall, Autarky provides a comprehensive defense for controlled-channel attacks which supports efficient secure demand paging, and adds no overheads in page-fault free execution.

BibTeX
@inproceedings{ autarky20eurosys,
authors={ Orenbach, Meni and Baumann, Andrew and Silberstein, Mark},
title = {{Autarky: Closing controlled channels with self-paging enclaves},
booktitle={ Fifteenth European Conference on Computer Systems, Heraklion, Greece},
series= {EuroSys ’20},
year= {2020},
}
[USENIX Security]   SpecFuzz: Bringing Spectre-type vulnerabilities to the surface
BibTeX for SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

BibTeX
@Inproceedings{SpeckFuzz20UsenixSec,
author = {Oleksii Oleksenko and
Bohdan Trach and
Mark Silberstein and
Christof Fetzer},
title = {SpecFuzz: Bringing Spectre-type vulnerabilities to the surface},
booktitle = {Proceedings of the 29th USENIX Security Symposium},
series = {USENIX Security '20},
year = {2020}
}
abstract for SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

SpecFuzz: Bringing Spectre-type vulnerabilities to the surface

Abstract

SpecFuzz is the first tool that enables dynamic testing for speculative execution vulnerabilities (e.g., Spectre). The key is a novel concept of speculation exposure: The program is instrumented to simulate speculative execution in software by forcefully executing the code paths that could be triggered due to mispredictions, thereby making the speculative memory accesses visible to integrity checkers (e.g., AddressSanitizer). Combined with the conventional fuzzing techniques, speculation exposure enables more precise identification of potential vulnerabilities compared to state-of-the-art static analyzers.
Our prototype for detecting Spectre V1 vulnerabilities successfully identifies all known variations of Spectre V1 and decreases the mitigation overheads across the evaluated applications, reducing the amount of instrumented branches by up to 93% given a sufficient test coverage.

BibTeX
@Inproceedings{SpeckFuzz20UsenixSec,
author = {Oleksii Oleksenko and
Bohdan Trach and
Mark Silberstein and
Christof Fetzer},
title = {SpecFuzz: Bringing Spectre-type vulnerabilities to the surface},
booktitle = {Proceedings of the 29th USENIX Security Symposium},
series = {USENIX Security '20},
year = {2020}
}
code for SpecFuzz: Bringing Spectre-type vulnerabilities to the surface
[MICRO Top Picks]   Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow
BibTeX for Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

BibTeX
@ARTICLE{8691527,
author={J. {Van Bulck} and M. {Minkin} and O. {Weisse} and D. {Genkin} and B. {Kasikci} and F. {Piessens} and M. {Silberstein} and T. F. {Wenisch} and Y. {Yarom} and R. {Strackx}},
journal={IEEE Micro},
title={Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow},
year={2019},
volume={39},
number={3},
pages={66-74},
keywords={security of data;software architecture;trusted computing;virtual machines;virtualisation;virtual memory protection;SGX ecosystem;foreshadow;speculative execution attack;security guarantees;virtual machines;physical memory;Intel Software Guard eXtensions;Program processors;Ecosystems;Microarchitecture;Kernel;Side-channel attacks},
doi={10.1109/MM.2019.2910104},
ISSN={1937-4143},
month={May},}
abstract for Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow

Abstract

Foreshadow is a speculative execution attack that allows adversaries to subvert the security guarantees of Intel’s Software Guard eXtensions (SGX). Foreshadow allows access to data across process boundaries, and allows virtual machines (VMs) to read the physical memory belonging to other VMs or the hypervisor.

BibTeX
@ARTICLE{8691527,
author={J. {Van Bulck} and M. {Minkin} and O. {Weisse} and D. {Genkin} and B. {Kasikci} and F. {Piessens} and M. {Silberstein} and T. F. {Wenisch} and Y. {Yarom} and R. {Strackx}},
journal={IEEE Micro},
title={Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow},
year={2019},
volume={39},
number={3},
pages={66-74},
keywords={security of data;software architecture;trusted computing;virtual machines;virtualisation;virtual memory protection;SGX ecosystem;foreshadow;speculative execution attack;security guarantees;virtual machines;physical memory;Intel Software Guard eXtensions;Program processors;Ecosystems;Microarchitecture;Kernel;Side-channel attacks},
doi={10.1109/MM.2019.2910104},
ISSN={1937-4143},
month={May},}
project for Breaking Virtual Memory Protection and the SGX Ecosystem with Foreshadow
Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, Raoul Strackx
Selected for publication in IEEE Micro Top Picks
[USENIX Security]   Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution
BibTeX for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

BibTeX
@inproceedings{vanbulck2018foreshadow,
author = {Van Bulck, Jo and Minkin, Marina and Weisse, Ofir and Genkin, Daniel and Kasikci, Baris and
Piessens, Frank and Silberstein, Mark and Wenisch, Thomas F. and Yarom, Yuval and Strackx, Raoul},
title = {Foreshadow: Extracting the Keys to the {Intel SGX} Kingdom with Transient Out-of-Order Execution},
booktitle = {Proceedings of the 27th {USENIX} Security Symposium},
year = {2018},
month = {August},
publisher = {{USENIX} Association},
note={See also technical report Foreshadow-NG~cite{weisse2018foreshadowNG}}
}
abstract for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution

Abstract

Trusted execution environments, and particularly the Software Guard eXtensions (SGX) included in recent Intel x86 processors, gained significant traction in recent years. A long track of research papers, and increasingly also real-world industry applications, take advantage of the strong hardware-enforced confidentiality and integrity guarantees provided by Intel SGX. Ultimately, enclaved execution holds the compelling potential of securely offloading sensitive computations to untrusted remote platforms.

We present Foreshadow, a practical software-only microarchitectural attack that decisively dismantles the security objectives of current SGX implementations. Crucially, unlike previous SGX attacks, we do not make any assumptions on the victim enclave’s code and do not necessarily require kernel-level access. At its core, Foreshadow abuses a speculative execution bug in modern Intel processors, on top of which we develop a novel exploitation methodology to reliably leak plaintext enclave secrets from the CPU cache. We demonstrate our attacks by extracting full cryptographic keys from Intel’s vetted architectural enclaves, and validate their correctness by launching rogue production enclaves and forging arbitrary local and remote attestation responses. The extracted remote attestation keys affect millions of devices.

BibTeX
@inproceedings{vanbulck2018foreshadow,
author = {Van Bulck, Jo and Minkin, Marina and Weisse, Ofir and Genkin, Daniel and Kasikci, Baris and
Piessens, Frank and Silberstein, Mark and Wenisch, Thomas F. and Yarom, Yuval and Strackx, Raoul},
title = {Foreshadow: Extracting the Keys to the {Intel SGX} Kingdom with Transient Out-of-Order Execution},
booktitle = {Proceedings of the 27th {USENIX} Security Symposium},
year = {2018},
month = {August},
publisher = {{USENIX} Association},
note={See also technical report Foreshadow-NG~cite{weisse2018foreshadowNG}}
}
slides for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution project for Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution
Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F. Wenisch, Yuval Yarom, Raoul Strackx
First Prize in CSAW Regional Competition
[PETS]   Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices
BibTeX for Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

BibTeX
@article{lifshits2018power,
title={Power to peep-all: Inference attacks by malicious batteries on mobile devices},
author={Lifshits, Pavel and Forte, Roni and Hoshen, Yedid and Halpern, Matt and Philipose, Manuel and Tiwari, Mohit and Silberstein, Mark},
journal={Proceedings on Privacy Enhancing Technologies},
volume={2018},
number={4},
pages={141--158},
year={2018},
publisher={Sciendo}
}
abstract for Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices

Abstract

Mobile devices are equipped with increasingly smart batteries designed to provide responsiveness and extended lifetime. However, such smart batteries may present a threat to users’ privacy. We demonstrate that the phone’s power trace sampled from the battery at 1KHz holds enough information to recover a variety of sensitive information.

We show techniques to infer characters typed on a  touchscreen; to accurately recover browsing history in an open-world setup; and to reliably detect incoming calls, and the photo shots including their lighting conditions. Combined with a novel exfiltration technique that establishes a covert channel from the battery to a remote server via a web browser, these attacks turn the malicious battery into a stealthy surveillance device. We deconstruct the attack by analyzing its robustness to sampling rate and execution conditions. To find mitigations we identify the sources of the information leakage exploited by the attack. We discover that the GPU or DRAM power traces alone are sufficient to distinguish between different websites. However, the CPU and power-hungry peripherals such as a touchscreen are the primary sources of fine-grain information leakage.

We consider and evaluate possible mitigation mechanisms,  highlighting the challenges to defend against the attacks. In summary, our work shows the feasibility of the malicious battery and motivates further research into system and application-level defenses to fully mitigate this emerging threat.

BibTeX
@article{lifshits2018power,
title={Power to peep-all: Inference attacks by malicious batteries on mobile devices},
author={Lifshits, Pavel and Forte, Roni and Hoshen, Yedid and Halpern, Matt and Philipose, Manuel and Tiwari, Mohit and Silberstein, Mark},
journal={Proceedings on Privacy Enhancing Technologies},
volume={2018},
number={4},
pages={141--158},
year={2018},
publisher={Sciendo}
}
slides for Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices video for Power to peep-all: Inference Attacks by Malicious Batteries on Mobile Devices
Pavel Lifshits, Roni Forte, Yedid Hoshen, Matt Halpern, Manuel Philipose, Mohit Tiwari, Mark Silberstein
Third Prize in CSAW Regional Competition
[USENIX ATC]   Varys: Protecting SGX enclaves from practical side-channel attacks
BibTeX for Varys: Protecting SGX enclaves from practical side-channel attacks

Varys: Protecting SGX enclaves from practical side-channel attacks

BibTeX
@inproceedings {216033,
author = {Oleksii Oleksenko and Bohdan Trach and Robert Krahn and Mark Silberstein and Christof Fetzer},
title = {Varys: Protecting {SGX} Enclaves from Practical Side-Channel Attacks},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {ISBN 978-1-939133-01-4},
address = {Boston, MA},
pages = {227--240},
url = {https://www.usenix.org/conference/atc18/presentation/oleksenko},
publisher = {{USENIX} Association},
month = jul,
}
abstract for Varys: Protecting SGX enclaves from practical side-channel attacks

Varys: Protecting SGX enclaves from practical side-channel attacks

Abstract

Numerous recent works have experimentally shown that Intel Software Guard Extensions (SGX) are vulnerable to cache timing and page table side-channel attacks which could be used to circumvent the data confidentiality guarantees provided by SGX. Existing mechanisms that protect against these attacks either incur high execution costs, are ineffective against certain attack variants, or require significant code modifications.

We present Varys, a system that protects unmodified programs running in SGX enclaves from cache timing and page table side-channel attacks. Varys takes a pragmatic approach of strict reservation of physical cores to security-sensitive threads, thereby preventing the attacker from accessing shared CPU resources during enclave execution. The key challenge that we are addressing is that of maintaining the core reservation in the presence of an untrusted OS.

Varys fully protects against all L1/L2 cache timing attacks and significantly raises the bar for page table side-channel attacks – all with only 15% overhead on average for Phoenix and PARSEC benchmarks. Additionally, we propose a set of minor hardware extensions that hold the potential to extend Varys’ security guarantees to L3 cache and further improve its performance.

BibTeX
@inproceedings {216033,
author = {Oleksii Oleksenko and Bohdan Trach and Robert Krahn and Mark Silberstein and Christof Fetzer},
title = {Varys: Protecting {SGX} Enclaves from Practical Side-Channel Attacks},
booktitle = {2018 {USENIX} Annual Technical Conference ({USENIX} {ATC} 18)},
year = {2018},
isbn = {ISBN 978-1-939133-01-4},
address = {Boston, MA},
pages = {227--240},
url = {https://www.usenix.org/conference/atc18/presentation/oleksenko},
publisher = {{USENIX} Association},
month = jul,
}
slides for Varys: Protecting SGX enclaves from practical side-channel attacks
Oleksii Oleksenki, Bohdan Trach, Robert Krahn, Andre Martin, Mark Silberstein, Christof Fetzer
[ARXIV]   You shall not bypass: Employing data dependencies to prevent bounds check bypass
BibTeX for You shall not bypass: Employing data dependencies to prevent bounds check bypass

You shall not bypass: Employing data dependencies to prevent bounds check bypass

BibTeX
@article{DBLP:journals/corr/abs-1805-08506,
author = {Oleksii Oleksenko and
Bohdan Trach and
Tobias Reiher and
Mark Silberstein and
Christof Fetzer},
title = {You Shall Not Bypass: Employing data dependencies to prevent Bounds
Check Bypass},
journal = {CoRR},
volume = {abs/1805.08506},
year = {2018},
url = {http://arxiv.org/abs/1805.08506},
archivePrefix = {arXiv},
eprint = {1805.08506},
timestamp = {Mon, 13 Aug 2018 16:48:45 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1805-08506.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
abstract for You shall not bypass: Employing data dependencies to prevent bounds check bypass

You shall not bypass: Employing data dependencies to prevent bounds check bypass

Abstract

A recent discovery of a new class of microarchitectural attacks called Spectre picked up the attention of the security community as these attacks can circumvent many
traditional mechanisms of defence. One of the attacks— Bounds Check Bypass—can neither be efficiently solved on system nor architectural levels and requires changes in the application itself. So far, the proposed mitigations involved serialization, which reduces the usage of CPU resources and causes high overheads. In this report, we explore methods of delaying the vulnerable instructions
without complete serialization. We discuss several ways of achieving it and compare them with Speculative Load Hardening, an existing solution based on a similar idea. The solutions of this type cause 60% overhead across Phoenix benchmark suite, which compares favourably to the full serialization causing 440% slowdown.

BibTeX
@article{DBLP:journals/corr/abs-1805-08506,
author = {Oleksii Oleksenko and
Bohdan Trach and
Tobias Reiher and
Mark Silberstein and
Christof Fetzer},
title = {You Shall Not Bypass: Employing data dependencies to prevent Bounds
Check Bypass},
journal = {CoRR},
volume = {abs/1805.08506},
year = {2018},
url = {http://arxiv.org/abs/1805.08506},
archivePrefix = {arXiv},
eprint = {1805.08506},
timestamp = {Mon, 13 Aug 2018 16:48:45 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1805-08506.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
[GPGPU]   Understanding The Security of Discrete GPUs
BibTeX for Understanding The Security of Discrete GPUs

Understanding The Security of Discrete GPUs

BibTeX
@incollection{zhu2017understanding,
title={Understanding the security of discrete GPUs},
author={Zhu, Zhiting and Kim, Sangman and Rozhanski, Yuri and Hu, Yige and Witchel, Emmett and Silberstein, Mark},
booktitle={Proceedings of the General Purpose GPUs},
pages={1--11},
year={2017},
series = {GPGPU' 17}
}
abstract for Understanding The Security of Discrete GPUs

Understanding The Security of Discrete GPUs

Abstract

GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure
co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of
the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived
GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general,
we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor
choice for applications requiring strong security, but also make GPUs into a security threat.

BibTeX
@incollection{zhu2017understanding,
title={Understanding the security of discrete GPUs},
author={Zhu, Zhiting and Kim, Sangman and Rozhanski, Yuri and Hu, Yige and Witchel, Emmett and Silberstein, Mark},
booktitle={Proceedings of the General Purpose GPUs},
pages={1--11},
year={2017},
series = {GPGPU' 17}
}
Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, Emmett Witchel, Mark Silberstein

GPU computing, Networking, Machine Learning, Distributed Systems

[WACI]   Putting Bugs in Your Data Center Might Actually be a Good Idea
abstract for Putting Bugs in Your Data Center Might Actually be a Good Idea

Putting Bugs in Your Data Center Might Actually be a Good Idea

Abstract

Data centers of cloud providers hold millions of processor
cores, exabytes of storage, and petabytes of network bandwidth.
Research shows that in 2019, data centers consumed
more than 2% of global electricity production, where 50% of
consumption targeted for cooling infrastructures. While the
most effective solution for thermal distribution is liquid cooling,
technical challenges and complexities make it expensive.
We suggest using living spiders as cooling devices for data
centers. A prior work shows that spider silk has high thermal
conductivity, close to that of copper: the second-best metallic
conductor. Spiders not only generate spider silk but maintain
it. Recruiting spiders for the job requires no more than inserting
bugs to the data center for the spiders to catch. This
solution is effective, self-sustaining, and environment-friendly,
but requires solving a number of non-trivial technical and
zoological challenges on the way to make it practical.

video for Putting Bugs in Your Data Center Might Actually be a Good Idea
Alon Rashelbach, Mark Silberstein
[ARXIV]   Faster Neural Network Training with Approximate Tensor Operations
BibTeX for Faster Neural Network Training with Approximate Tensor Operations

Faster Neural Network Training with Approximate Tensor Operations

BibTeX
@article{DBLP:journals/corr/abs-1805-08079,
author = {Menachem Adelman and
Mark Silberstein},
title = {Faster Neural Network Training with Approximate Tensor Operations},
journal = {CoRR},
volume = {abs/1805.08079},
year = {2018},
url = {http://arxiv.org/abs/1805.08079},
archivePrefix = {arXiv},
eprint = {1805.08079},
timestamp = {Mon, 13 Aug 2018 16:48:57 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1805-08079.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
abstract for Faster Neural Network Training with Approximate Tensor Operations

Faster Neural Network Training with Approximate Tensor Operations

Abstract

We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable column-row sampling algorithm to approximating multi-channel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR-100 and Penn Tree Bank datasets and demonstrate 30%-80% reduction in the amount of computations while maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.

BibTeX
@article{DBLP:journals/corr/abs-1805-08079,
author = {Menachem Adelman and
Mark Silberstein},
title = {Faster Neural Network Training with Approximate Tensor Operations},
journal = {CoRR},
volume = {abs/1805.08079},
year = {2018},
url = {http://arxiv.org/abs/1805.08079},
archivePrefix = {arXiv},
eprint = {1805.08079},
timestamp = {Mon, 13 Aug 2018 16:48:57 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1805-08079.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Menachem Edelman, Mark Silberstein
[USENIX ATC]   SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
BibTeX for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

BibTeX
@inproceedings {203153,
author = {Shai Bergman and Tanya Brokhman and Tzachi Cohen and Mark Silberstein},
title = {{SPIN}: Seamless Operating System Integration of Peer-to-Peer {DMA} Between SSDs and GPUs},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {167--179},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/bergman},
publisher = {{USENIX} Association},
month = jul,
}
abstract for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs

Abstract

Recent GPUs enable Peer-to-Peer Direct Memory Access (P2P) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using P2P to access files is challenging because of the subtleties of low-level nonstandard interfaces, which bypass the OS file I/O layers and may hurt system performance.

SPIN integrates P2P into the standard OS file I/O stack, dynamically activating P2P where appropriate, transparently to the user. It combines P2P with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding P2P throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, and enables 3.3× higher throughput for a GPU-accelerated log server.

BibTeX
@inproceedings {203153,
author = {Shai Bergman and Tanya Brokhman and Tzachi Cohen and Mark Silberstein},
title = {{SPIN}: Seamless Operating System Integration of Peer-to-Peer {DMA} Between SSDs and GPUs},
booktitle = {2017 {USENIX} Annual Technical Conference ({USENIX} {ATC} 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {167--179},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/bergman},
publisher = {{USENIX} Association},
month = jul,
}
slides for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs code for SPIN: Seamless OS integration of Peer-to-Peer DMA between SSDs and GPUs
Shai Bergman, Tanya Brokhman, Tsahi Cohen, Mark Silberstein
[GPGPU]   Understanding The Security of Discrete GPUs
BibTeX for Understanding The Security of Discrete GPUs

Understanding The Security of Discrete GPUs

BibTeX
@incollection{zhu2017understanding,
title={Understanding the security of discrete GPUs},
author={Zhu, Zhiting and Kim, Sangman and Rozhanski, Yuri and Hu, Yige and Witchel, Emmett and Silberstein, Mark},
booktitle={Proceedings of the General Purpose GPUs},
pages={1--11},
year={2017},
series = {GPGPU' 17}
}
abstract for Understanding The Security of Discrete GPUs

Understanding The Security of Discrete GPUs

Abstract

GPUs have become an integral part of modern systems, but their implications for system security are not yet clear. This paper demonstrates both that discrete GPUs cannot be used as secure
co-processors and that GPUs provide a stealthy platform for malware. First, we examine a recent proposal to use discrete GPUs as secure co-processors and show that the security guarantees of
the proposed system do not hold on the GPUs we investigate. Second, we demonstrate that (under certain circumstances) it is possible to bypass IOMMU protections and create stealthy, long-lived
GPU-based malware. We demonstrate a novel attack that compromises the in-kernel GPU driver and one that compromises GPU microcode to gain full access to CPU physical memory. In general,
we find that the highly sophisticated, but poorly documented GPU hardware architecture, hidden behind obscure close-source device drivers and vendor-specific APIs, not only make GPUs a poor
choice for applications requiring strong security, but also make GPUs into a security threat.

BibTeX
@incollection{zhu2017understanding,
title={Understanding the security of discrete GPUs},
author={Zhu, Zhiting and Kim, Sangman and Rozhanski, Yuri and Hu, Yige and Witchel, Emmett and Silberstein, Mark},
booktitle={Proceedings of the General Purpose GPUs},
pages={1--11},
year={2017},
series = {GPGPU' 17}
}
Zhiting Zhu, Sangman Kim, Yuri Rozhanski, Yige Hu, Emmett Witchel, Mark Silberstein
[EuroCrypt'17]   Computational integrity with a public random string from quasi-linear PCPs
BibTeX for Computational integrity with a public random string from quasi-linear PCPs

Computational integrity with a public random string from quasi-linear PCPs

BibTeX
@inproceedings{ben2017computational,
title={Computational integrity with a public random string from quasi-linear PCPs},
author={Ben-Sasson, Eli and Bentov, Iddo and Chiesa, Alessandro and Gabizon, Ariel and Genkin, Daniel and Hamilis, Matan and Pergament, Evgenya and Riabzev, Michael and Silberstein, Mark and Tromer, Eran and Virza, Madars},
booktitle={Annual International Conference on the Theory and Applications of Cryptographic Techniques},
pages={551--579},
year={2017},
organization={Springer}
}
abstract for Computational integrity with a public random string from quasi-linear PCPs

Computational integrity with a public random string from quasi-linear PCPs

Abstract

A party running a computation remotely may benefit from misreporting its output,
say, to lower its tax. Cryptographic protocols that detect and prevent such falsities hold the promise to enhance the security of decentralized systems with stringent computational integrity requirements, like Bitcoin [Nak09]. To gain public trust it is imperative to use publicly verifiable protocols that have no “backdoors” and which can be set up using only a short public random string. Probabilistically Checkable Proof (PCP) systems [BFL90, BFLS91, AS98, ALM + 98] can be used to construct astonishingly efficient protocols [Kil92, Mic00] of this nature but some of the main
components of such systems — proof composition [AS98] and low-degree testing via PCPs of Proximity (PCPPs) [BGH + 05, DR06] — have been considered efficient only asymptotically, for unrealistically large computations; recent cryptographic alternatives [PGHR13, BCG + 13a] suffer from a non-public setup phase. This work introduces SCI, the first implementation of a scalable PCP system (that uses both PCPPs and proof composition). We used SCI to prove correctness of executions of up to 2 20 cycles of a simple processor and calculated  its break-even
point [SVP + 12, SMBW12]. The significance of our findings is two-fold: (i) it marks the transition of core PCP techniques (like proof composition and PCPs of Proximity ) from mathematical theory to practical system engineering, and (ii) the thresholds obtained are nearly achievable and hence show that PCP-supported computational integrity is closer to reality than previously assumed.

BibTeX
@inproceedings{ben2017computational,
title={Computational integrity with a public random string from quasi-linear PCPs},
author={Ben-Sasson, Eli and Bentov, Iddo and Chiesa, Alessandro and Gabizon, Ariel and Genkin, Daniel and Hamilis, Matan and Pergament, Evgenya and Riabzev, Michael and Silberstein, Mark and Tromer, Eran and Virza, Madars},
booktitle={Annual International Conference on the Theory and Applications of Cryptographic Techniques},
pages={551--579},
year={2017},
organization={Springer}
}
slides for Computational integrity with a public random string from quasi-linear PCPs video for Computational integrity with a public random string from quasi-linear PCPs
Eli Ben-Sasson, Iddo Ben-Tov, Alessandro Chiesa, Ariel Gabizon, Daniel Genkin, Matan Hamilis, Evgenya Pergament, Michael Riabzev, Mark Silberstein, Eran Tromer, Madars Virza Annual International Conference on the Theory and Applications of Cryptographic Techniques
[ICS]   Fast Multiplication in Binary Fields on GPUs via Register Cache
BibTeX for Fast Multiplication in Binary Fields on GPUs via Register Cache

Fast Multiplication in Binary Fields on GPUs via Register Cache

BibTeX
@inproceedings{gpufft16ics,
author = {Ben-Sasson, Eli and Hamilis, Matan and Silberstein, Mark and Tromer, Eran},
title = {Fast Multiplication in Binary Fields on GPUs via Register Cache},
year = {2016},
isbn = {9781450343619},
publisher = {ACM},
url = {https://doi.org/10.1145/2925426.2926259},
doi = {10.1145/2925426.2926259},
booktitle = {Proceedings of the 2016 International Conference on Supercomputing},
articleno = {Article 35},
numpages = {12},
keywords = {Finite Field Multiplication, GPGPU, SIMD, Parallel Algorithms, GPU Code Optimization},
location = {Istanbul, Turkey},
series = {ICS ’16}
}



abstract for Fast Multiplication in Binary Fields on GPUs via Register Cache

Fast Multiplication in Binary Fields on GPUs via Register Cache

Abstract

Finite fields of characteristic 2 — “binary fields” — are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in “large” binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements and achieves high performance via bit-slicing and fine-grained parallelization.

The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.

We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

BibTeX
@inproceedings{gpufft16ics,
author = {Ben-Sasson, Eli and Hamilis, Matan and Silberstein, Mark and Tromer, Eran},
title = {Fast Multiplication in Binary Fields on GPUs via Register Cache},
year = {2016},
isbn = {9781450343619},
publisher = {ACM},
url = {https://doi.org/10.1145/2925426.2926259},
doi = {10.1145/2925426.2926259},
booktitle = {Proceedings of the 2016 International Conference on Supercomputing},
articleno = {Article 35},
numpages = {12},
keywords = {Finite Field Multiplication, GPGPU, SIMD, Parallel Algorithms, GPU Code Optimization},
location = {Istanbul, Turkey},
series = {ICS ’16}
}



slides for Fast Multiplication in Binary Fields on GPUs via Register Cache code for Fast Multiplication in Binary Fields on GPUs via Register Cache Icon Other for Fast Multiplication in Binary Fields on GPUs via Register Cache
Matan Hamilis, Eli Ben-Sasson, Eran Tromer, Mark Silberstein
Also published in NVIDIA Developers Blog
[EuroSys]   Optimizing Distributed Actor Systems for Dynamic Interactive Services
BibTeX for Optimizing Distributed Actor Systems for Dynamic Interactive Services

Optimizing Distributed Actor Systems for Dynamic Interactive Services

BibTeX
@inproceedings{actop16eurosys,
author = {Newell, Andrew and Kliot, Gabriel and Menache, Ishai and Gopalan, Aditya and Akiyama, Soramichi and Silberstein, Mark},
title = {Optimizing Distributed Actor Systems for Dynamic Interactive Services},
year = {2016},
isbn = {9781450342407},
publisher = {ACM},
url = {https://doi.org/10.1145/2901318.2901343},
doi = {10.1145/2901318.2901343},
booktitle = {Proceedings of the Eleventh European Conference on Computer Systems},
articleno = {Article 38},
numpages = {15},
location = {London, United Kingdom},
series = {EuroSys ’16}
}
abstract for Optimizing Distributed Actor Systems for Dynamic Interactive Services

Optimizing Distributed Actor Systems for Dynamic Interactive Services

Abstract

Distributed actor systems are widely used for developing interactive scalable cloud services, such as social networks and online games. By modelling an application as a dynamic set of lightweight communicating “actors”, developers can easily build complex distributed applications, while the underlying runtime system deals with low-level complexities of a distributed environment.

We present ActOp—a data-driven, application-independent runtime mechanism for optimizing end-to-end service latency of actor-based distributed applications. ActOp targets the two dominant factors affecting latency: the overhead of remote inter-actor communications across servers, and the intra-server queuing delay. ActOp automatically identifies frequently communicating actors and migrates them to the same server transparently to the running application. The migration decisions are driven by a novel scalable distributed graph partitioning algorithm which does not rely on a single server to store the whole communication graph, thereby enabling efficient actor placement even for applications with rapidly changing graphs (e.g., chat services). Further, each server autonomously reduces the queuing delay by learning an internal queuing model and configuring threads according to instantaneous request rate and application demands.

We prototype ActOp by integrating it with Orleans — a popular open-source actor system [4, 13]. Experiments with realistic workloads show latency improvements of up to 75% for the 99th percentile, up to 63% for the mean, with up to 2x increase in peak system throughput.

BibTeX
@inproceedings{actop16eurosys,
author = {Newell, Andrew and Kliot, Gabriel and Menache, Ishai and Gopalan, Aditya and Akiyama, Soramichi and Silberstein, Mark},
title = {Optimizing Distributed Actor Systems for Dynamic Interactive Services},
year = {2016},
isbn = {9781450342407},
publisher = {ACM},
url = {https://doi.org/10.1145/2901318.2901343},
doi = {10.1145/2901318.2901343},
booktitle = {Proceedings of the Eleventh European Conference on Computer Systems},
articleno = {Article 38},
numpages = {15},
location = {London, United Kingdom},
series = {EuroSys ’16}
}
Andrew Newell, Gabriel Kliot, Ishai Menashe, Aditya Gopalan, Soramichi Akiyama, Mark Silberstein
[SYSTOR]   Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage
BibTeX for Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

BibTeX
@inproceedings{erasurecoding14systor,
author = {Silberstein, Mark and Ganesh, Lakshmi and Wang, Yang and Alvisi, Lorenzo and Dahlin, Mike},
title = {Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-Coded Distributed Storage},
year = {2014},
publisher = {ACM},
url = {https://doi.org/10.1145/2611354.2611370},
doi = {10.1145/2611354.2611370},
booktitle = {Proceedings of International Conference on Systems and Storage},
pages = {1–7},
numpages = {7},
keywords = {Distributed storage systems, Erasure codes, Repair bandwidth},
location = {Haifa, Israel},
series = {SYSTOR 2014}
}

abstract for Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage

Abstract

Erasure coding schemes provide higher durability at lower storage cost, and thus constitute an attractive alternative to replication in distributed storage systems, in particular for storing rarely accessed “cold” data. These schemes, however, require an order of magnitude higher recovery bandwidth for maintaining a constant level of durability in the face of node failures. In this paper we propose lazy recovery, a technique to reduce recovery bandwidth demands down to the level of replicated storage. The key insight is that a careful adjustment of recovery rate substantially reduces recovery bandwidth, while keeping the impact on read performance and data durability low. We demonstrate the benefits of lazy recovery via extensive simulation using a realistic distributed storage configuration and published component failure parameters. For example, when applied to the commonly used RS(14, 10) code, lazy recovery reduces repair bandwidth by up to 76% even below replication, while increasing the amount of degraded stripes by 0.1 percentage points. Lazy recovery works well with a variety of erasure coding schemes, including the recently introduced bandwidth efficient codes, achieving up to a factor of 2 additional bandwidth savings.

BibTeX
@inproceedings{erasurecoding14systor,
author = {Silberstein, Mark and Ganesh, Lakshmi and Wang, Yang and Alvisi, Lorenzo and Dahlin, Mike},
title = {Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-Coded Distributed Storage},
year = {2014},
publisher = {ACM},
url = {https://doi.org/10.1145/2611354.2611370},
doi = {10.1145/2611354.2611370},
booktitle = {Proceedings of International Conference on Systems and Storage},
pages = {1–7},
numpages = {7},
keywords = {Distributed storage systems, Erasure codes, Repair bandwidth},
location = {Haifa, Israel},
series = {SYSTOR 2014}
}

slides for Lazy Means Smart: Reducing Repair Bandwidth Costs in Erasure-coded Distributed Storage
Lakshmi Ganesh, Yang Wang, Lorenzo Alvisi, Mike Dahlin,
Best Paper Award
[ACM UBIQUITY]   GPUs: High-performance Accelerators for Parallel Applications.
BibTeX for GPUs: High-performance Accelerators for Parallel Applications.

GPUs: High-performance Accelerators for Parallel Applications.

BibTeX
@article{uniquity,
author = {Silberstein, Mark},
title = {GPUs: High-Performance Accelerators for Parallel Applications: The Multicore Transformation (Ubiquity Symposium)},
year = {2014},
issue_date = {August 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2014},
number = {August},
url = {https://doi.org/10.1145/2618401},
doi = {10.1145/2618401},
journal = {Ubiquity},
month = aug,
articleno = {Article 1},
numpages = {13}
}



abstract for GPUs: High-performance Accelerators for Parallel Applications.

GPUs: High-performance Accelerators for Parallel Applications.

Abstract

Early graphical processing units (GPUs) were designed as high compute density, fixed-function processors ideally crafted to the needs of computer graphics workloads. Today, GPUs are becoming truly first-class computing elements on par with CPUs. Programming GPUs as self-sufficient general-purpose processors is not only hypothetically desirable, but feasible and efficient in practice, opening new opportunities for integration of GPUs in complex software systems.

BibTeX
@article{uniquity,
author = {Silberstein, Mark},
title = {GPUs: High-Performance Accelerators for Parallel Applications: The Multicore Transformation (Ubiquity Symposium)},
year = {2014},
issue_date = {August 2014},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2014},
number = {August},
url = {https://doi.org/10.1145/2618401},
doi = {10.1145/2618401},
journal = {Ubiquity},
month = aug,
articleno = {Article 1},
numpages = {13}
}



Mark Silberstein
Invited to Ubiquity Symposium on Parallel Computing