ActivePointers: A Case for Software Address Translation on GPUs

ActivePointers enable virtual address space management and page fault handling for GPUs. In particular, ActivePointers enable to mmap files into the GPU’s virtual address space, which allow to significantly simplify application development on GPUs, as mmap eliminates buffer allocation, read/write system calls, and file pointer arithmetics, as well as enable seamless serialization/deserialization of in-memory data structures to/from files.

System Overview

This work introduces a GPU-centric application-level VM management design. Since GPU VM manipulation is highly restricted, ActivePointers aims to overcome the limitations of the GPU’s hardware VM entirely in software.

In the GPU-centric VM management design using ActivePointers, the address translation layer triggers a page fault which is executed on the GPU itself (rather than on the CPU in a CPU-centric VM management design). Data is copied from the backing store and written into the GPU I/O page cache. The GPU updates the I/O page table. No accesses to the hardware VM page table are necessary.

 

 

Design

The key building block of our design is a new type of memory pointer called an active pointer or apointer. Apointers are used to access avirtual memory and behave just like regular memory pointers. They support all the standard pointer operations, such as dereferencing and pointer arithmetics, and can be passed as function parameters or return values. Under the hood, apointers trigger page faults, monitor memory protection and help track active pages.
A linked apointer holds a valid avirtual-to-aphysical mapping, i.e., it holds a reference to an active page. Dereferencing a linked apointer guarantees page fault-free data access, and requires no translation lookup. The system is designed to store the aphysical address of a page in the apointer itself, that is, in a hardware register that holds the value of the apointer when accessing memory.

Under The Hood

An apointer is implemented as a C++ class. It comprises two parts; a translation field, which stores the avirtual-to aphysical mapping and is hence used in every access to an apointer, and metadata fields, used only in page faults. The translation field is specifically designed to fit into 64 bit. This allows the compiler to cache it in a hardware register, which is crucial for reducing the overhead of fault-free accesses.

Address translation requests are aggregated, and our mechanism enables efficient access to an apointer in page-fault free and page fault cases. If no page fault are encountered by any of the threads, they quickly return the data without divergence. Otherwise, subgroups of threads accessing the same page select one thread from the group as a leader, which handles the page fault for that group.

 

We also implement a direct mapped TLB for simplicity. The TLB is implemented as a simple concurrent hash table in the per-threadblock scratchpad memory, enabling lock free search and locked modifications.

 

Evaluation

While the apointer dereferencing and arithmetics operations add computations on each access, their overheads become largely hidden as the GPU occupancy grows. This latency hiding ability of the GPU architecture is the key to providing efficient address translation in software.

An end-to-end image collage application (which substitutes blocks in the input image with “similar” images from a large dataset, where the similarity is defined as the Euclidean distance between image evaluation), ActivePointers does not exhibit any measurable overheads over GPUfs, while enabling convenient mmap operations to map the whole image dataset into the GPU memory.