Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Kata Containers: Virtual Machines that feel and perform like containers (katacontainers.io)
146 points by flanked-evergl on July 17, 2023 | hide | past | favorite | 83 comments


Kata used “Linux kernel Direct Access filesystem (DAX)” to directly share access of the host filesystem to the guest kernel. I thought this was pretty interesting, but it sounds like a possible spot to start a jailbreak. I’m guessing these kinds of optimizations along with using super simple virtualized devices is what gives Kata its almost-cgroups-like performance.

> Mapping files using DAX provides a number of benefits over more traditional VM file and device mapping mechanisms:

> Mapping as a direct access device allows the guest to directly access the host memory pages (such as via Execute In Place (XIP)), bypassing the guest kernel's page cache. This zero copy provides both time and space optimizations.

> Mapping as a direct access device inside the VM allows pages from the host to be demand loaded using page faults, rather than having to make requests via a virtualized device (causing expensive VM exits/hypercalls), thus providing a speed optimization.

> Utilizing mmap(2)'s MAP_SHARED shared memory option on the host allows the host to efficiently share pages.

From https://github.com/kata-containers/kata-containers/tree/main...


Yeah, it's worth understanding the attack surface of DAX - if someone has information I'd be very interested. That said, you could mitigate it in other ways depending on your use case.

Having gone through an evaluation of Firecracker's security my main conclusion was that sandboxing the processes in the guest is the highest 'bang for your buck' way to reduce escapes.


Would it be simple enough to stick a union file system over the top of the host file system?


Do you mean inside the guest? I'm sure they do stuff like that; they operate the normal sort of cgroups + fs mappings inside the guest VM to create inner containers. Eg all the containers in a Kubernetes pod run inside the same VM, but different containers within the VM.


virtiofs supports DAX.



Vagrant also kind-of fits the description, but I've been looking for an alternative due to it's libvirt support being iffy (the plugin breaks often or has issues compared to the virtualbox provider, also not many official image providers).

Here's one option I came across recently that's pretty nice and simple, using virsh and cloud images, and making it easy to spin up a VM and ssh into it. [1]

One use case I'm looking for is to provision system images including ZFS pools/dataset, and not sharing the host kernel module.

[1] https://earlruby.org/2023/02/quickly-create-guest-vms-using-...


I love the vagrant UX, I’d be thrilled to see a refresh for 2023. Box management is odd. Weaveworks ignite/footloose are also good but don’t quite scratch my itch. I think vagrant UX with ignite style kernel/rootfs as oci image is a nice flow.


> Weaveworks ignite/footloose are also good but don’t quite scratch my itch.

Are these still going? They seem a bit dead to me.


that’s my take too. At least for ignite, I think Lucas had to take time off for college. Not sure about footloose. I guess another reason I suggested a refresh ;)

Firecracker and firectl themselves are maintained of course, but lower level.


I also use cloud init and one can quickly (5 seconds) spin up VMs with it. I use virt-install which has command line options for the cloud init related files which makes everything very easy.


I'm very interested in the concept, but found Kata to be hard to get going with.

Last time I looked (a few months ago), the documentation was pretty sparse or outdated. A lot of documentation I found stated something like "we broke that in the new version, you can't actually do that right now", like using it with Docker, which I would much prefer over setting up Kubernetes.

I still think it's an absolutely great thing, but the on onboarding could have been a lot less rocky.

FWIW, I eventually kind of DIYed it with QEMU MicroVM and virtiofs - never did anything with it though.


On the onboarding point: Almost all projects would do themselves a big service by putting work into the onboarding experience because it's a key factor in how many people will use and help developing the project.


> Last time I looked (a few months ago), the documentation was pretty sparse or outdated.

It still is, though it works somewhat seamlessly when installing with https://github.com/kata-containers/kata-containers/blob/main...

Though only one of the hypervisors works well.


> Though only one of the hypervisors works well.

Well don't leave us hanging, which one?

(My money's on QEMU)


Sorry, I should have said, but yes it is QEMU which is the default.


I don't understand how can they be as fast as regular containers if they run an entire kernel on top of an hypervisor ?


I don't have numbers either, but it's a combination of extreme focus on the boot path and virtio drivers, and traditional containers now being quite heavyweight to start (especially when run via Kubernetes).

The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front. This can practically limit the number of Katacontainers you can run to something much smaller than is possible with ordinary containers, since RAM is the constrained resource on most servers.

Nevertheless, with confidential computing coming along, it's likely that at some point in the future many containers will really be VMs, since current CPUs implement confidential computing on top of existing VM primitives (and that's basically necessary due to the way the guest RAM is encrypted). It's likely that any workload that touches PII, finance, health, etc will be required to use confidential computing.


> but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.

Yup, that's always been the big reason to use containers for me. Startup time and runtime performance are nice benefits, but the memory usage is the giant win. Freeing memory in response to the apps need and also not needing extra memory for running the various OS parts and pieces.

The down side is, of course, security. But that was always the case with containers.


I think in the longer run, WASM might displace a lot of both in practical terms.


Why would wasm replace containers? If you're going to run a binary why not just compile it for the local system?

We've always had 'compile once, run anywhere' but there's always been caveats and gotchas.


Something still compiles a WASM binary for the local system. Possibly, being able to optimize the WASM without recompiling it from source might be a win? Not needing separate binaries for ARM and x86 is nice, so it should run on a Mac more easily. Also, it runs on an edge server or in a browser, even on a phone, if you care about that.

I don’t think it will replace Docker files since they let you package up such a wide variety of existing server software and WASM is more limited. But if your software does compile to WASM then maybe you don’t care about that.

I think of WASM more like a plugin format, but I expect there will be a lot of engineering effort put into optimizing it, like happened with V8 for JavaScript. Not all web standards win, but betting against one that’s well-established and has a lot of support seems like a mistake.


wasm targets the wasm runtime Virtual Machine (ie: a JavaScript VM). Offering fine grained isolation compared to virtualizing the whole operating system.

edit: don't shoot the messenger. I was merely highlighting the main difference between native and webasm in the context of the discussion.


You could say the same thing for anything that targets JVM or CLR, and they're far more mature than any JS runtime.


Or even Lua, which is trivial to sandbox.


For serverless (as in AWS-lambda-like) I agree. In that usecase WASM provides a better security barrier than containers, with faster cold-start time (which is really important for the scaling promise of these services).

For the stuff people run on their Kubernetes clusters I have more mixed expectations. Containers are more universal, but I can totally see a microservice architecture running as a lot of WASM runtimes with a handful of containers.


Particularly in the case of Amazon lambdas, those are running in VMS already (firecracker). Why wouldn't you skip the VM and use a static binary from rust instead?


> The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.

Conversely the problem with containers is that memory allocation including the OS page cache is not guaranteed. That's bad for a lot of applications, especially databases. It seems Docker has some support for shared page cache but it's not in the Kubernetes pod spec as far as I can see. [0] You would probably need some kind of annotations and a specialized controller to make this work.

[0] https://github.com/kubernetes/kubernetes/issues/43916


In Kubernetes 1.25+ page cache usage accounting is improved thanks to use of cgroupsv2. https://kubernetes.io/docs/concepts/architecture/cgroups/


Kata containers support memory ballooning like most modern VMs: https://en.wikipedia.org/wiki/Memory_ballooning so a fixed allocation isn't needed, reducing over provisioning

https://github.com/kata-containers/kata-containers/blob/d50f... uses virtio-mem


This isn't a substitute (nor is virtio-mem, the modern equivalent). The problem is the application running in userspace inside the guest cannot request more memory when, for example, it does a mmap or sbrk.


Which application languages and frameworks support this kind of dynamic memory allocation? For predictability in performance and throughput reasons we benchmark our Java applications on specific cpu and memory constraints and specific heap and memory settings. How would an app in a container suddenly give back ram? A garbage collected application may be able to do that by collecting garbage. Possibly. But others?


No idea about Java, but any C program will request memory using mmap, and may give it back using munmap. This doesn't work when the program is running inside a VM, but does work for containers (which are basically just regular processes).


How does this work in practice? If the application didn’t need this memory anymore because it is done with the work/data, shouldn’t it already have freed or munmapped it? Is there a signal that can be send to a process to free up and return memory?


With VM balloning, VMs are also able to claim and release memory to the hypervisor/host OS.


Not driven by the guest application they don't. It's frustrating that I actually work in this area and know what I'm talking about. I worked on Katacontainers back when it was Intel's Clear Containers in the late 2010s. So many people in this thread do not have a clue.


Interesting, that's a very annoying constraint then


i couldn't understand the comment that "the application running in user space cannot request more memory" - can someone explain whats the point of memorybalooning anywhere if an application cannot signal when the system should actually provision physical memory from the 'baloon'


There isn't a point, that's the problem.


The systems administrator can use ballooning to give more memory to a VM before launching a new application. This avoids the need to shut down the VM to give it a new role.

There is still a benefit to ballooning support even if it's not exposed to userspace within the VM, because VMs aren't always used purely to host a single infinitely-long-lived application without outside intervention.


Linux supports hot attach ram or the VMM could support memory ballooning. VMs don’t necessarily need to all be backed by physical RAM.


Slightly off topic, but regarding the larger memory footprint of Kata containers, what is your opinion on KSM effectiveness in general for VMs?


We had a lot of reports of ksm/ksmtuned consuming a lot of CPU and not making a lot of difference. I think it works well for certain workloads, and can be quite pessimal for others. There are also security concerns because you can leak information about (eg) what glibc is being used by another tenant using timing attacks. So you'd probably want to turn it off if multiple tenants can be using a single node.


there's probably a big asterisk there. the correct term is probably "fast enough"

virtualization adds very overhead, a Windows VM running with a dedicated GPU can get 95% of the host's score on 3dmark.

the biggest issue on these cases is IO which can be handled in a few ways.


> virtualization adds very overhead

Missing word?

"Very little"? 5% is enough to turn this year's high-end machine into last year's model.

Nested virtualisation is also a thing now, and 5% per layer adds up fast.


The short version is that kernels support guest/host relationships natively so that guests can pass operations directly to the host without having to go through an additional system call. Everywhere you do this is attack surface where an attacker in the guest can communicate with privileged facilities, so you want to minimize this where you can.

There's usually overhead in the places where the communication requires an additional hop. If you want your host filesystem isolated you're going to need a translation layer and it will be slower. If you're willing to open up your host OS's filesystem, you can basically get ~0 overhead.


there is pretty low overhead if you are opinionated - this is very similar to firecracker (AWS) tooling, so cut down hypervisor with ~ 0 devices, and a cut down guest OS means pretty quick boot times


Yeah I'd like to see some numbers on that, like startup time.


Depending on your use case there's potentially negligible startup time. On the scale of single digit seconds to less then half a second depending on how much work you put into optimizing it. For some applications this will be too slow (mainly the type where you boot a container per request, although flyio seems to make it work), I think for a _lot_ of applications this wouldn't be noticed.

Kata gives you a few different options for what/how you'd like to boot including firecracker.

This isn't exclusive to firecracker but if you stay lightweight you can have vm's booting under a half second if you're using slim images.

https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-l...

I honestly think for a lot of people, vm's with the convenience/orchestration tools of containers make more sense for a lot of general use cases simply because of the security benefits. The convenience still needs some work though.


Unless you're dealing with a multi-tenant situation I'm not super convinced that a VM is worth the effort. It's not the perf, it's the need to make your kernel, root file system, and other infra needed to make it all work.

Compare that to a docker container where there's basically 0 additional work that has to be done to be up and running.

For most cases I'd be really tempted to work on hardening the docker container than on setting up a VM. Things like Apparmor and seccomp in particular would likely go a very long way.


I wish that someone would put up a honeypot website running various containers holding a root user, where the only goal is to break out of the container.

I suspect that nearly all containerization software is insecure. Especially with timing attacks like side-channel attacks (Row Hammer, etc):

https://en.wikipedia.org/wiki/Timing_attack

https://en.wikipedia.org/wiki/Side-channel_attack

In the end, the only way to "prove" container security is to be able to point to the fact that nobody has broken out of it yet. It's ..remarkable that our entire cloud infrastructure runs on containers that have never been audited by brute-force in this manner.


Lots of people have tested these systems.

Here's work we did to exploit Firecracker. We had a known, promising vulnerability, and still failed to break out.

https://web.archive.org/web/20220927150915/https://www.grapl...

Exploitation is extremely hard. No one has ever exploited RowHammer in the wild, to my knowledge, and there are a lot of reasons why - but even still, RowHammer isn't magic, you can mitigate against it by increasing your refresh rate, or by limiting the attacker's execution time. Not to mention that vendors have deployed numerous methods of reducing the likelihood of an attack and it is quite complex to pull off these days.


I think transparent memory encryption effectively defeats RowHammer attacks.

The CPU has a key it uses to decrypt memory on access - and that key is never known by the operating system or any software running within it. If you use RowHammer to access the "wrong" memory location, the decrypted values would be random garbage.

Pure hardware protection, no changes to memory chips, almost no runtime performance cost, effective security.


AFAIK RowHammer is already a non issue on the latest hardware since the module will speculatively refresh rows that may be vulnerable. ie: If a ton of writes happen to row B then rows A and C will have a refresh triggered too.


> I suspect that nearly all containerization software is insecure.

This seems, well, naive?

You think there aren't a lot of people that have tried to break out of cloud containers?

Both EC2 and gcloud have had issues over the years with container breakout and leaks.

Complex (ie literally layers of operating systems) software has bugs. Yes. But so does the non-containerised base-case.

We have the best style of honeypots you could ever ask for already running- payment infrastructure on the internet. Go get 'em.



Apologies everyone, I've been a little too close to my work lately and jumping to conclusions. A normal person might have phrased this like:

Hey that's an interesting solution you have there, do you think that emulation might help guard against side-channel attacks? Do you have any plans to mitigate them? I wonder if anyone has gamified container security, maybe with a honeypot somewhere where people could try to break out.


Those are opensource, you can run them yourself, no need for a "honeypot" hosted by someone else.

And I have no idea why you think no one is trying to break them or audit them.


That honeypot was called VPS hosting and over ten years it didn't get hacked much if at all.


Quantum hack?

Nah, it’s 2023, brute it


> Runs in a dedicated kernel, providing isolation of network, I/O and memory and can utilize hardware-enforced isolation with virtualization VT extensions.

So if it’s a dedicated kernel, can this fool game anti cheat systems into thinking it’s not in a VM? Or still the same problem?


If a game anti-cheat system can detect that a regular VM is a VM, then it will also detect Kata VM is a VM. In both cases you're running a game in a VM with a dedicated kernel.

Kata VMs are especially VM-y because they use a lot of VM-only features that wouldn't work with real hardware to enhance performance by sharing work between the guest and the host.


Well it’s not that it “detects” it but most game AC (Valgrind or BattleEye) install at the ring0 layer (kernel). So on VMs, that doesn’t exist (from what I understand). But with this it should be enough to fool it into thinking it’s installed in the right space.


> So on VMs, [ring0 layer (the kernel)] doesn’t exist (from what I understand)

The kernel still exists in a VM. The “dedicated kernel” bit of Kata distinguishes it from typical containers, not from VMs.

Kata is an abstraction atop existing hypervisors that are themselves just an abstraction over KVM. There’s nothing new here w.r.t. VM detection evasion.

> fool it into thinking it’s installed in the right space

Most VM detection is about observing devices, drivers, timings, or other side-channel type data that is often only seen in a VM.


Most VM detection I've seen is done through fingerprinting. Checking what the hardware is, certain CPU features/flags, config files, etc.


Is it possible to add kata containers in eks or gke?


Not in EKS. EC2 doesn't support nested virtualization today.


Peer pods are meant to solve this (one day) by running the "containers" (ie VMs) to the side as regular AWS instances and peering the communications:

https://www.redhat.com/en/blog/red-hat-openshift-sandboxed-c...


AWS already has an implementation of this, namely, Nitro Enclaves. A Nitro Enclave is an isolated VM used for confidential computing purposes. You provide the Enclave an EIF file, which is basically an OCI image with cryptographic measurements. The Nitro hypervisor launches the VM and allows communications with its parent VM only over a Linux vsock connection.

Also, AWS has Fargate, which is basically containers-as-VMs. Every ECS task or EKS Pod is launched as a separate EC2 instance under the hood. This obviates a lot of the need for a solution like Kata there.


So why would you use this over regular containers? Is this for people who think containers are insecure?


Containers are not appropriate for some scenarios. If your threat model is "attacker has arbitrary code execution" you should be very wary of a container holding them for long - you're one kernel privesc away from a container escape. It's nice to have if your situation is "first the attacker would need to exploit my service and then they'd have code exec", but if you provide RCE as a service a normal container is just not a good barrier.

By contrast, running attacker code in Firecracker is much safer - for one thing, the attacker needs to escalate within the VM in order to then expose the necessary primitives to then escape the VM. So it immediately adds an additional required vuln. But also, instead of the entire Linux kernel being the attack surface (it is for the first vuln, but...) you have to attack a much smaller codebase that implements the VM.

In the case of Firecracker this is really hard and, depending on your exploit, if you end up controlling the Firecracker VM itself you're actually in yet another sandbox - so you now need another vuln to escape (although that sandbox is not amazing imo).


But containers run everywhere. ECS for example. If they were insecure then half the web services we use every day would get hacked.


ECS is not using containers for isolation though. A single ESC cluster will only ever run on single tenant infrastructure.

When AWS requires isolation they never use containers.


Run multiple workloads on the same CPU that you don't fully trust. VM-based containers will force the CPU to properly context switch between guests (e.g. using VMENTER on x86), normal process-based containers won't.


Many serverless services use MicroVM rather because they think containers doesn't have enough isolation (=insecure?) to run untrusted code.


Serious question - at this point why not just use actual VM's?


That's what this is. "Containers" implemented as actual VMs.


1. So you can run any container image that someone made into a Docker container instead of building your own VM images

2. So you can run in Kubernetes and use the Kubernetes API, alongside any non-VM Pods and many other features in your Kubernetes cluster

3. Because traditional hypervisors have a higher overhead cost than the microVMs used by Kata and Firecracker

4. Because traditional hypervisors have a longer startup time (e.g. Firecracker is used by Amazon to run Lambda functions)


The API for containers is widely used.


They are VMs, but without the annoying "pet" UX.


How does this compare to Singularity containers?


Should i run `docker run helloworld` or use katacontainer VM to output Hello world ?


If you install kata with https://github.com/kata-containers/kata-containers/blob/main...

Then you just use it with:

docker run --runtime io.containerd.kata.v2 --rm -it hello-world




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: