The Two Years of Kata Containers

This is the first post of my series blogs after the Shanghai Summit:
- The Two Years of Kata Containers
- Kata Containers: Virtualization for Cloud-Native
- The Blueprint of Kata 2.0
We launched the Kata Containers project in December 2017. After that, a series of community actions, including the open-source of gVisor and Firecracker, showed that we did push the sandboxing technologies in the cloud-native world forward. It’s our great honor to see that more and more users adopted secure containers and contributed to the open-source communities.
Here I will summarize the technical achievements in the past 2 years and the pain points that still exist. And in the following article, the author will post the thoughts on future development from the PTG attendees and other community members.
Achievement: Seamless integration with upstream Kubernetes
At the beginning, the integration with the Kubernetes ecosystem is one of the top priority goals of Kata Containers. Collaborating with related communities, we kept working on guaranteeing and improving the compatibility with Kubernetes. Here the author listed some of them:
First of all, we introduced many Kubernetes related integration tests, such as node-e2e conformance tests, cri-containerd tests, k8s test for http and tcp liveness probes, heptio/sonobuoy e2e conformance tests. All the tests help us not breaking compatibility with kubernetes in each change.
The Kata Containers developers have taken much effort on the Kubernetes compatibility, in which networking is the most remarkable example. We implemented bridge and macvtap for the integration with existed Kubernetes network schemes. In release 1.6, we introduced tcfilter as the new default networking mode, through which we may connect to almost all kubernetes network out of the box. On the other hand, we introduce enlightened network mode for more efficient networking for Kata Containers.
Aside from improving Kata Containers, developments taken on other communities helps our Kubernetes compatibility as well.
The most import one is RuntimeClass introduced by Kubernetes SIG-Node, by which the Kubelet users may specify the runtime (e.g. Kata or runC) they want to select for their every Pod and the configuration profile for the runtime. Moreover, we added enhancements (for example, the overhead of Pod sandbox) to RuntimeClass, through which we may help the users and the scheduler to use Kata Containers better.
In short, the ecosystem has become more secure-container-friendly, which should credit to all secure container developers and other developers from the related communities.
Achievement: Fewer indirections in the stack
One of my favorite quotes is David Wheeler’s
All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.
For the seamless integration with the Kubernetes ecosystem, we introduced many indirections, which is not only inelegant but also harmful for the operability in production clusters. We have been working on removing unnecessary indirections.
The first removed component is the kata-proxy. Having adopted the vsock as the default communication channel to the agent, the kata-proxy is disappeared in our architecture at all, and the mux-demux in agent becomes unnecessary.
A more important architecture change is the introduction of shim-v2, which is raised in containerd community by the Kubernetes SIG-Node firstly, and implemented by containerd and Kata Containers then in 2018. Later the shim-v2 is supported by CRI-O as well. The idea behind the interface is no longer treating a container as a group of processes, and then interacting with the containers with RPC protocol instead of the POSIX signals. By implementing the shim-v2 interface with a per-Pod “shim”, the number of Kata containers assistant processes are reduced from 2N+2 to 1 for an N-container Pod. The following figure shows the change.
Achievement: Reduce the consumption
The isolation of the Pod sandbox is useful for most cases, but many users don’t want to pay the overhead for it. The slogan of the project is “the secure of VMs, the speed of containers” at the beginning. We have taken great effort in reducing consumption in the past years.
From the beginning, two technologies are adopted to accelerate the booting and reduce memory consumption — VM template and DAX. Both technologies aim to share the memory of binaries among Pods.
The VM template comes from the runV by hyper.sh. In brief, the template is a paused empty VM that includes the booted kernel and started kata-agent. Having the template there, we will do a “live-migration” from the template without memory copy as they both in the same machine. After that, we may resume the copied Pod sandbox and launch containers inside.
The DAX is originally developed for non-volatile memory devices and introduced to Clear Containers by Intel developers. The idea behind DAX is that we do not need to mmap files into physical memory if the backend storage is (non-volatile) memory. In our case, we may tell the in-sandbox kernel to access the devices with DAX and not mmap the files into guest memories. Then all the shared static files among Pods may share the page caches on the host side.
In the past two years, template related patches have been merged into Qemu upstream, and the DAX is used wider.
On the other hand, we have never stopped our efforts in reducing the VMM overhead. Qemu-lite, and Nemu are introduced to reduce the Qemu overhead, some of the efforts in the two projects have been merged back to Qemu upstream.
In late 2018, AWS announced lightweight VMM Firecracker, we supported it after just a couple of weeks. After that, we drove the project rust-vmm and will adopt rust-vmm based VMM, cloudhypervisor as another VMM for Kata, which will be lightweight, secure, and easy to be customized for Kata Containers.
And in 2019, we introduced a new in-sandbox agent written in Rust, which has just been merged one week before the Summit and will be shipped in release 1.10. The test on the initial version shows it could reduce the anonymous pages, which could not be saved by DAX or template, from 11MB to 1.1MB.
There are still ongoing works, and our mission is to develop the (almost) zero-overhead sandboxing technology.
Achievement: The Beginning of Virtualization for CloudNative
Different from traditional VM’s world, containers are Application-centric, there means different requirements on virtualizations. The vsock mentioned above is an example, which is insignificant in previous elastic computing services, however, in Kata Containers, it is one of the most important features because the more communication requirements between host and guest.
The virtio-fs is better evidence here. Unlike VirtFS (9p), which is a network filesystem in nature, the virtio-fs is designed for local filesystem sharing, which is widely employed in the container world. The virtio-fs adopted fuse, vhost-user, and DAX technologies. As a result, it has much better POSIX compatibility and performance compared to 9p, moreover, for the shared files, which is very common for container images, the DAX in virtio-fs could save the page cache memory among containers as mentioned above. Another strength is the virtiofs backend is a user-space daemon on the host, which makes it easy to add sophisticated optimization logic.
Aside from virtio-vsock and virtio-fs, the memory scaling technology virtio-mem is under development and will be introduced for Kata Containers in the future, which will make the virtualization here more container-like.
Drawback: Not end-to-end isolated yet
Though we have made a bunch of achievements in the past two years, there are still many jobs to do. The most important thing is to strengthen the isolation more.
As we mentioned before, the Kata Containers need to be more content-aware compared to traditional virtualization. However, the content-awareness may imply less isolated. We need to be more careful to well isolate the sandboxes each other. In a forum in the Shanghai Summit, we discussed the threating model:
First of all, the containerd or CRI-O prepared the pipes and storage on the host and passed them to the sandbox then, which means the user image and iostream processing are not isolated yet and this is harmful to both security and QoS.
On the right-bottom part of the figure, the vsock-vhost (for Qemu) and MACVTap are located in the host kernel. In theory, the user may coin some packages to attack the host kernel here. At this point, the user-space vsock implementation in Firecracker is thought to be better.
The attendees of the forum concluded that the host kernel should be protected most carefully, then VMM and guest kernel. On the other hand, the agent and guest kernel should be the most dangerous components in the model.
Drawback: The operability and debuggability
Another trade-off between isolation and content-awareness is about operability and debuggability. Comparing to tradition containers, we hardened the isolation, which makes the users will be hard to run system tools to debug the application and makes the metrics data collection become harder.
In the Shanghai PTG, the developers talked we should make more improvements on operability and debuggability, such as improving the event interfaces.
Summary
Look back at the past two years, we have improved the isolation in the container world at the expense of some overhead. We believe the container world needs a better isolation solution without hurts on the application-centric nature of Cloud-Native. Our vision of the Kata Containers project is to isolate the Cloud Native application transparently with sandboxing technologies at a minimal cost. I will explain our thoughts on future developments in the following posts.