Research NICA: Inline Near-Data Processing on SmartNICs

Illustration: Inline Processing - Filtering

Illustration: Inline Processing - Transformation

To accelerate networking tasks and allow customization, NIC vendors have been offering SmartNICs – programmable network interface cards – as a way to offload more CPU networking tasks to the NIC. SmartNICs, and specifically, FPGA-based SmartNICs are commonly used by cloud providers for accelerating their software defined network.

However, inline acceleration can further be used by server application to offload common networking tasks to the NIC. For example, SmartNICs can filter packets coming in from the network, reducing the CPU load and they can transform packets to formats more suitable for CPU consumption.

For example, SmartNICs can cache popular key-value pairs in a key-value store, and reduce the CPU load by responding to requests that hit the cache directly from the NIC. They can also be used to validate a cryptographic signature on incoming requests, and filter invalid requests immediately, improving performance during denial of service attacks.

Inline processing OS abstractions

Illustration: The ikernel abstraction NICA provides abstractions to simplify the use of such accelerators by server applications. It uses a new abstraction – the ikernel – to instantiate and control SmartNIC AFUs (accelerator functional units). Applications may attach an ikernel to a POSIX socket, causing the socket’s traffic to be processed by the appropriate AFU on the SmartNIC.

ikernels can be easily used in existing applications. Here is a code example:

// Create handle
k = ik_create(MEMCACHED_AFU);
ik_command(k, CONFIGURE, ...);
// Init a socket.
s = socket(...); bind(s, ...);
// Activate the ikernel
ik_attach(k, s);
// Use POSIX APIs to receive
while (recvmsg(s, buf, ...))
  ...

Virtualization support

Illustration: NICA Virtualization & Performance Isolation

The NICA framework provides fine-grain virtualization support for AFUs, allowing a single AFU to be shared among multiple cloud tenants. This allows cloud providers to share SmartNIC resources among several tenants, improving the utilization.

By tagging the I/O interfaces of the AFU NICA is able to isolate its state, and by providing schedulers for the I/O interfaces it allows cloud providers to guarantee performance isolation for different tenants.

Implementation

We implemented NICA’s hardware components and our example AFUs in C++, converting it to hardware using the Xilinx Vivado high-level-synthesis (HLS) tool. To simplify hardware development, we have developed a library of reusable packet-processing components, and a methodology for developing high-performance packet processing applications in HLS.

Our library, ntl (Networking Template Library), provides common components for data-flow processing, and allows customizing components using higher-order functions.

Category	Classes
Header processing elements	pop/push_header, push_suffix
Data-structures	array, hash_table
Scheduler	scheduler
Basic elements	map, scan, fold, dup, zip, link
Specialized stream wrappers	pack_stream, pfifo, stream<Tag>
Control-plane	gateway

An example UDP parser using the functional components of ntl:

Fold & scan usage: parser example

The development of ntl allows us to share code among AFUs and NICA’s infrastructure, while reaching line-rate in hardware.

Key-value store cache example

We developed a key-value store AFU cache to show the benefits of NICA, and evaluated it on a Mellanox Innova Flex card, which uses a bump-in-the-wire design.

We integrated the key-value store cache AFU with memcached and show it improves performance both in bare-metal and virtualization scenarios.

Graph: Key-Value Store Cache - Bare Metal RO

Graph: Key-Value Store Cache - Virtualization