Exploration and Practice of Performance Tuning for Kata Containers 2.0
Li Ning, Software R&D engineer of China Mobile Cloud Competence Center
On December 5, 2020, OpenAnolis community held its first in-person Cloud Native Infrastructures Meetup in Beijing. Technical experts from Alibaba cloud, Ant group, Intel, China Mobile, Red Hat and other companies discussed the cloud native infrastructure technologies such as kernel, container, and virtualization. We analyzed relevant open source technologies and community progress and shared enterprise landing and practical experiences.
At the meeting, the topic shared by the basic software product team from China Mobile Cloud is “Exploration and Practice of Performance Tuning for Kata Containers 2.0”. We introduced the relationship between China Mobile Cloud and Kata Containers, shared the reasons why China Mobile Cloud selected Kata Containers as its security containers, and some performance tuning exploration and optimization ideas for Kata Containers.

Why do we need Kata Containers?
Background

The current background is that we are building our FaaS platform product based on serverless architecture. Soon, we will develop our security container service platform product. These two products have strict requirements for container security. It is time-consuming and costly to develop a security container solution starting from scratch. Therefore, based on the existing open source security container technology, we hope to make use of the power of open source to help our China Mobile Cloud technology products land.
Technology selection for security container solution
We have compared the current mainstream security container solutions with Kata Containers:

● Google’s gVisor has designed and implemented a kind of security sandbox from scratch. It uses ptrace or kvm to intercept the system calls of the processes in the container and implements a user mode kernel as the isolation layer to handle these syscalls. The security of gVisor is very good, but the performance is average, and it only supports limited system calls, which is not universal;
● AWS’ Firecracker is not a cloud native way. Although it can also integrate with Kubernetes, the integration scheme is relatively more complex;
● Kata Containers is based on the existing virtualization technology, using a lightweight KVM virtual machine as the safe sandbox running container, as the security isolation layer.
Kata itself is designed as a kind of container runtime, which is natively compatible with Open Container Initiative (OCI) and CRI. Compared with gVisor and Firecracker, Kata is more cloud native and more natural to integrate with Kubernetes.
Kata officially supports four hypervisors now:
● QEMU is a mature and complex hypervisor.
● Firecracker is a lightweight hypervisor developed by AWS for serverless scenarios。 It only supports limited virtual devices.
● Cloud-Hypervisor is another lightweight hypervisor designed by Intel for cloud native scenarios.
● ACRN is a hypervisor developed for edge scenarios.
At present, we can say that the hypervisor supported by Kata Containers covers a variety of demanded scenarios, so we are evaluating Kata to meet the needs of some of our business scenarios.
Introduction of Kata Containers technology solution
Architecture research
We have done the architecture research based on Kata v2.0. Compared with v1.0, Kata v2.0 reduces some working components, the overhead of additional components, and makes the architecture more clear.
The core idea of Kata is to use a virtual machine as a security sandbox. One of the main challenges that is solved is how to manage the containers across the virtual machine. With a layer of virtual machines, container management cannot be done directly. Therefore, container management needs to be achieved by the kata-agent inside of the virtual machine.
In terms of technical implementation, there are three core problems that need to be solved:

How does the shim process outside the virtual machine communicate with the agent process inside the virtual machine?
Kata 2.0 uses virtio-sock as the communication channel inside and outside the virtual machine to solve the communication problem between shim and agent.

How can the image/rootfs outside the virtual machine be accessed by the container inside the virtual machine to support container creation and running?
According to OCI spec, an OCI bundle is needed when creating a container. The OCI bundle is composed of config.json describing the configuration of the container and the rootfs of the container.
The config can be passed to the agent in VM by gRPC calling through vsock. The rootfs required by the container is on the host. How can the container image/rootfs be accessed by the agent in the VM? Kata solves this problem mainly in two ways:

● Shared-fs: mount the host directory into the virtual machine by shared-fs. Kata supports virtio-9p and virtio-fs, which are two implementation methods of sharing fs between guest and host. In 2.0, virtio-fs is adopted default, and its performance is much better than virtio-9p;
● Block device passthrough: The image/rootfs block device based on device mapper on host is passed through to the VM by virtio-scsi or virtio-blk (with their corresponding backend). All the IO on the virtio-blk/virtio-scsi rootfs device in the VM will be handled by their virtio backend on host, finally writing to the rootfs block device on host.
How does the virtual machine network interface with the existing container network (such as CNI)?

At present, there are two ways for Kata to connect veth and tap: tcfilter and macvtap
● Tcfilter is to use TC rules to connect the ingress of veth with the egress of tap and the egress of veth with the ingress of tap respectively. The effect is to connect the sender and receiver of veth with the sender and receiver of tap in series, and the overall effect is equivalent to one network card;
● Macvtap is also a mature technology in virtual machine network virtualization. If macvtap is used, veth will judge whether the destination mac address matches when it receives the packet, and if mac address matching, the packet will be transferred directly to the tap device.
Optimization ideas
According to the working principle of Kata’s network model, here are some optimization ideas that we have:
● We can abandon veth pair and directly bridge the tap of kata VM with cni0 bridge;
● It can be implemented based on dpdk. Cni0 bridge needs to be implemented as a high-performance vswitch based on dpdk. By using vhost user communication mechanism, the VM network card is connected with dpdk vswitch to realize a high-performance user mode network from the VM network card to cni0 bridge.
These two directions are relatively mature solutions in the existing virtual machine network. The key and the challenge is how to redesign the CNI plugin.
Summary of technical points
Let’s quickly summarize the technical points of the implementation of Kata Containers:
● VM as a safe sandbox
● Support a variety of virtualization solutions, such as QEMU, Firecracker, cloud hypervisor, etc
● Agent in VM is responsible for creating, updating and destroying containers directly
● Use vsock as the communication channel between shim V2 process and agent
● Use tc rules and macvtap to link veth and tap to get through CNI and VM network
● Mount the host image/rootfs to VM through virtio-9p and virtio-fs
● The block device on the host is used as the container rootfs by passthru
So far, we have shared with you our thoughts on choosing Kata Containers as a safe container solution, as well as some analysis and a summary of the Kata Containers’ technical details and our own optimization ideas.
Outlook
Kata Containers bring us virtual machine level security and isolation, as well as container level startup speed. At present, we believe that Kata Containers is moving towards lightweight. In Kata 2.0, some components are rewritten with rust, which brings lower cost. At the same time, Intel has also developed a lightweight hypervisor for cloud native scenarios based on rust-vmm. The cloud native oriented design concept of cloud hypervisor matches Kata perfectly. Support for cloud hypervisors is also added in Kata, and Cloud Hypervisor will become the main hypervisor of Kata in the future. With the combination of these two, we believe Kata Containers will be the main solution for cloud native security containers in the future.
We have finished the evaluation for Kata Containers before the event ended, and we are now making plans to use Kata for production in 2021.