SPIN: OS integration of Peer-to-peer DMA between GPUs and SSDs

SPIN integrates Peer-to-Peer data transfers between GPUs and NVMe devices into the standard OS file I/O stack, dynamically activating P2P where appropriate, transparently to the user. It combines P2P with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.

System Overview

GPU vendors enable the use of P2P in order to transfer data from a DMA capable device straight to the GPU’s internal memory. SPIN enables data transfer between an SSD and a GPU only via PCI express, with no intermediate stop in system memory. This in by itself does not solve a myriad of performance limiting issues, such as data reads which can be served from a cached copy in the RAM.





SPIN aims to integrate P2P into the OS file I/O layer. It uses P2P as a low-level mechanism for optimizing file I/O where  applicable. SPIN is positioned on top of the Virtual File System (VFS) layer. The user allocates the destination buffer in GPU memory and passes the pointer to the buffer to pread. The SPIN core is implemented in P-router.
P-router inspects every I/O request ( 1 in the Figure) and detects the requests that operate on GPU memory buffers and are amenable to P2P.
P-router invokes the P-readahead mechanism, which identifies sequential access pattern and prefetches file contents into a GPU read-ahead partition (GPU RA in the Figure) of the CPU page cache.
It also checks with P-cache whether the request can be served from the page cache, and creates an I/O schedule to interleave P2P and page cache accesses.
Finally, it generates VFS I/O requests that are served by a combination of the page cache 2.b and P2P 2.a . To invoke P2P via direct disk I/O interface, P-router employs an address tunneling mechanism.

Under The Hood

SPIN utilizes an address tunneling mechanism to enable P2P data transfers without tainting the Linux kernel. The main obstacle in achieving this is the lack of  struct pages which describe the GPU memory, and thus does not allow a direct transfer from the SSD to the GPU. The address tunneling mechanism circumvents that by utilizing special phony buffers to envelope GPU buffer addresses.

SPIN decides whether to transfer SSD data that partially resides in the page cache based on a greedy heuristic that approximates the overall data transfer time (via modeling SSD data transfer latency as a piece-wise linear function and performing parameter fitting).




SPIN outperforms sole P2P DMA transfers and memory copies by the CPU by up 20%, across different levels of page cache residency. That is because it combines both page cache and P2P, dynamically choosing between them per request depending on the residence in the page cache.






In real-world applications, such as a GPU accelerated log server, SPIN achieves significant speedups over the alternative data transfer methods