Kata used “Linux kernel Direct Access filesystem (DAX)” to directly share access of the host filesystem to the guest kernel. I thought this was pretty interesting, but it sounds like a possible spot to start a jailbreak. I’m guessing these kinds of optimizations along with using super simple virtualized devices is what gives Kata its almost-cgroups-like performance.
> Mapping files using DAX provides a number of benefits over more traditional VM file and device mapping mechanisms:
> Mapping as a direct access device allows the guest to directly access the host memory pages (such as via Execute In Place (XIP)), bypassing the guest kernel's page cache. This zero copy provides both time and space optimizations.
> Mapping as a direct access device inside the VM allows pages from the host to be demand loaded using page faults, rather than having to make requests via a virtualized device (causing expensive VM exits/hypercalls), thus providing a speed optimization.
> Utilizing mmap(2)'s MAP_SHARED shared memory option on the host allows the host to efficiently share pages.
Yeah, it's worth understanding the attack surface of DAX - if someone has information I'd be very interested. That said, you could mitigate it in other ways depending on your use case.
Having gone through an evaluation of Firecracker's security my main conclusion was that sandboxing the processes in the guest is the highest 'bang for your buck' way to reduce escapes.
Do you mean inside the guest? I'm sure they do stuff like that; they operate the normal sort of cgroups + fs mappings inside the guest VM to create inner containers. Eg all the containers in a Kubernetes pod run inside the same VM, but different containers within the VM.
Vagrant also kind-of fits the description, but I've been looking for an alternative due to it's libvirt support being iffy (the plugin breaks often or has issues compared to the virtualbox provider, also not many official image providers).
Here's one option I came across recently that's pretty nice and simple, using virsh and cloud images, and making it easy to spin up a VM and ssh into it. [1]
One use case I'm looking for is to provision system images including ZFS pools/dataset, and not sharing the host kernel module.
I love the vagrant UX, I’d be thrilled to see a refresh for 2023. Box management is odd. Weaveworks ignite/footloose are also good but don’t quite scratch my itch. I think vagrant UX with ignite style kernel/rootfs as oci image is a nice flow.
that’s my take too. At least for ignite, I think Lucas had to take time off for college. Not sure about footloose. I guess another reason I suggested a refresh ;)
Firecracker and firectl themselves are maintained of course, but lower level.
I also use cloud init and one can quickly (5 seconds) spin up VMs with it. I use virt-install which has command line options for the cloud init related files which makes everything very easy.
I'm very interested in the concept, but found Kata to be hard to get going with.
Last time I looked (a few months ago), the documentation was pretty sparse or outdated. A lot of documentation I found stated something like "we broke that in the new version, you can't actually do that right now", like using it with Docker, which I would much prefer over setting up Kubernetes.
I still think it's an absolutely great thing, but the on onboarding could have been a lot less rocky.
FWIW, I eventually kind of DIYed it with QEMU MicroVM and virtiofs - never did anything with it though.
On the onboarding point: Almost all projects would do themselves a big service by putting work into the onboarding experience because it's a key factor in how many people will use and help developing the project.
I don't have numbers either, but it's a combination of extreme focus on the boot path and virtio drivers, and traditional containers now being quite heavyweight to start (especially when run via Kubernetes).
The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front. This can practically limit the number of Katacontainers you can run to something much smaller than is possible with ordinary containers, since RAM is the constrained resource on most servers.
Nevertheless, with confidential computing coming along, it's likely that at some point in the future many containers will really be VMs, since current CPUs implement confidential computing on top of existing VM primitives (and that's basically necessary due to the way the guest RAM is encrypted). It's likely that any workload that touches PII, finance, health, etc will be required to use confidential computing.
> but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.
Yup, that's always been the big reason to use containers for me. Startup time and runtime performance are nice benefits, but the memory usage is the giant win. Freeing memory in response to the apps need and also not needing extra memory for running the various OS parts and pieces.
The down side is, of course, security. But that was always the case with containers.
Something still compiles a WASM binary for the local system. Possibly, being able to optimize the WASM without recompiling it from source might be a win? Not needing separate binaries for ARM and x86 is nice, so it should run on a Mac more easily. Also, it runs on an edge server or in a browser, even on a phone, if you care about that.
I don’t think it will replace Docker files since they let you package up such a wide variety of existing server software and WASM is more limited. But if your software does compile to WASM then maybe you don’t care about that.
I think of WASM more like a plugin format, but I expect there will be a lot of engineering effort put into optimizing it, like happened with V8 for JavaScript. Not all web standards win, but betting against one that’s well-established and has a lot of support seems like a mistake.
wasm targets the wasm runtime Virtual Machine (ie: a JavaScript VM). Offering fine grained isolation compared to virtualizing the whole operating system.
edit: don't shoot the messenger. I was merely highlighting the main difference between native and webasm in the context of the discussion.
For serverless (as in AWS-lambda-like) I agree. In that usecase WASM provides a better security barrier than containers, with faster cold-start time (which is really important for the scaling promise of these services).
For the stuff people run on their Kubernetes clusters I have more mixed expectations. Containers are more universal, but I can totally see a microservice architecture running as a lot of WASM runtimes with a handful of containers.
Particularly in the case of Amazon lambdas, those are running in VMS already (firecracker). Why wouldn't you skip the VM and use a static binary from rust instead?
> The big problem with Katacontainers is not whether or not they are slightly faster or slower than containers, but the fixed memory allocation which means you must first know and then allocate the maximum amount of memory they might ever need up front.
Conversely the problem with containers is that memory allocation including the OS page cache is not guaranteed. That's bad for a lot of applications, especially databases. It seems Docker has some support for shared page cache but it's not in the Kubernetes pod spec as far as I can see. [0] You would probably need some kind of annotations and a specialized controller to make this work.
This isn't a substitute (nor is virtio-mem, the modern equivalent). The problem is the application running in userspace inside the guest cannot request more memory when, for example, it does a mmap or sbrk.
Which application languages and frameworks support this kind of dynamic memory allocation? For predictability in performance and throughput reasons we benchmark our Java applications on specific cpu and memory constraints and specific heap and memory settings. How would an app in a container suddenly give back ram? A garbage collected application may be able to do that by collecting garbage. Possibly. But others?
No idea about Java, but any C program will request memory using mmap, and may give it back using munmap. This doesn't work when the program is running inside a VM, but does work for containers (which are basically just regular processes).
How does this work in practice? If the application didn’t need this memory anymore because it is done with the work/data, shouldn’t it already have freed or munmapped it? Is there a signal that can be send to a process to free up and return memory?
Not driven by the guest application they don't. It's frustrating that I actually work in this area and know what I'm talking about. I worked on Katacontainers back when it was Intel's Clear Containers in the late 2010s. So many people in this thread do not have a clue.
i couldn't understand the comment that "the application running in user space cannot request more memory" - can someone explain whats the point of memorybalooning anywhere if an application cannot signal when the system should actually provision physical memory from the 'baloon'
The systems administrator can use ballooning to give more memory to a VM before launching a new application. This avoids the need to shut down the VM to give it a new role.
There is still a benefit to ballooning support even if it's not exposed to userspace within the VM, because VMs aren't always used purely to host a single infinitely-long-lived application without outside intervention.
We had a lot of reports of ksm/ksmtuned consuming a lot of CPU and not making a lot of difference. I think it works well for certain workloads, and can be quite pessimal for others. There are also security concerns because you can leak information about (eg) what glibc is being used by another tenant using timing attacks. So you'd probably want to turn it off if multiple tenants can be using a single node.
The short version is that kernels support guest/host relationships natively so that guests can pass operations directly to the host without having to go through an additional system call. Everywhere you do this is attack surface where an attacker in the guest can communicate with privileged facilities, so you want to minimize this where you can.
There's usually overhead in the places where the communication requires an additional hop. If you want your host filesystem isolated you're going to need a translation layer and it will be slower. If you're willing to open up your host OS's filesystem, you can basically get ~0 overhead.
there is pretty low overhead if you are opinionated - this is very similar to firecracker (AWS) tooling, so cut down hypervisor with ~ 0 devices, and a cut down guest OS means pretty quick boot times
Depending on your use case there's potentially negligible startup time. On the scale of single digit seconds to less then half a second depending on how much work you put into optimizing it. For some applications this will be too slow (mainly the type where you boot a container per request, although flyio seems to make it work), I think for a _lot_ of applications this wouldn't be noticed.
Kata gives you a few different options for what/how you'd like to boot including firecracker.
This isn't exclusive to firecracker but if you stay lightweight you can have vm's booting under a half second if you're using slim images.
I honestly think for a lot of people, vm's with the convenience/orchestration tools of containers make more sense for a lot of general use cases simply because of the security benefits. The convenience still needs some work though.
Unless you're dealing with a multi-tenant situation I'm not super convinced that a VM is worth the effort. It's not the perf, it's the need to make your kernel, root file system, and other infra needed to make it all work.
Compare that to a docker container where there's basically 0 additional work that has to be done to be up and running.
For most cases I'd be really tempted to work on hardening the docker container than on setting up a VM. Things like Apparmor and seccomp in particular would likely go a very long way.
I wish that someone would put up a honeypot website running various containers holding a root user, where the only goal is to break out of the container.
I suspect that nearly all containerization software is insecure. Especially with timing attacks like side-channel attacks (Row Hammer, etc):
In the end, the only way to "prove" container security is to be able to point to the fact that nobody has broken out of it yet. It's ..remarkable that our entire cloud infrastructure runs on containers that have never been audited by brute-force in this manner.
Exploitation is extremely hard. No one has ever exploited RowHammer in the wild, to my knowledge, and there are a lot of reasons why - but even still, RowHammer isn't magic, you can mitigate against it by increasing your refresh rate, or by limiting the attacker's execution time. Not to mention that vendors have deployed numerous methods of reducing the likelihood of an attack and it is quite complex to pull off these days.
I think transparent memory encryption effectively defeats RowHammer attacks.
The CPU has a key it uses to decrypt memory on access - and that key is never known by the operating system or any software running within it. If you use RowHammer to access the "wrong" memory location, the decrypted values would be random garbage.
Pure hardware protection, no changes to memory chips, almost no runtime performance cost, effective security.
AFAIK RowHammer is already a non issue on the latest hardware since the module will speculatively refresh rows that may be vulnerable. ie: If a ton of writes happen to row B then rows A and C will have a refresh triggered too.
Apologies everyone, I've been a little too close to my work lately and jumping to conclusions. A normal person might have phrased this like:
Hey that's an interesting solution you have there, do you think that emulation might help guard against side-channel attacks? Do you have any plans to mitigate them? I wonder if anyone has gamified container security, maybe with a honeypot somewhere where people could try to break out.
> Runs in a dedicated kernel, providing isolation of network, I/O and memory and can utilize hardware-enforced isolation with virtualization VT extensions.
So if it’s a dedicated kernel, can this fool game anti cheat systems into thinking it’s not in a VM? Or still the same problem?
If a game anti-cheat system can detect that a regular VM is a VM, then it will also detect Kata VM is a VM. In both cases you're running a game in a VM with a dedicated kernel.
Kata VMs are especially VM-y because they use a lot of VM-only features that wouldn't work with real hardware to enhance performance by sharing work between the guest and the host.
Well it’s not that it “detects” it but most game AC (Valgrind or BattleEye) install at the ring0 layer (kernel). So on VMs, that doesn’t exist (from what I understand). But with this it should be enough to fool it into thinking it’s installed in the right space.
> So on VMs, [ring0 layer (the kernel)] doesn’t exist (from what I understand)
The kernel still exists in a VM. The “dedicated kernel” bit of Kata distinguishes it from typical containers, not from VMs.
Kata is an abstraction atop existing hypervisors that are themselves just an abstraction over KVM. There’s nothing new here w.r.t. VM detection evasion.
> fool it into thinking it’s installed in the right space
Most VM detection is about observing devices, drivers, timings, or other side-channel type data that is often only seen in a VM.
AWS already has an implementation of this, namely, Nitro Enclaves. A Nitro Enclave is an isolated VM used for confidential computing purposes. You provide the Enclave an EIF file, which is basically an OCI image with cryptographic measurements. The Nitro hypervisor launches the VM and allows communications with its parent VM only over a Linux vsock connection.
Also, AWS has Fargate, which is basically containers-as-VMs. Every ECS task or EKS Pod is launched as a separate EC2 instance under the hood. This obviates a lot of the need for a solution like Kata there.
Containers are not appropriate for some scenarios. If your threat model is "attacker has arbitrary code execution" you should be very wary of a container holding them for long - you're one kernel privesc away from a container escape. It's nice to have if your situation is "first the attacker would need to exploit my service and then they'd have code exec", but if you provide RCE as a service a normal container is just not a good barrier.
By contrast, running attacker code in Firecracker is much safer - for one thing, the attacker needs to escalate within the VM in order to then expose the necessary primitives to then escape the VM. So it immediately adds an additional required vuln. But also, instead of the entire Linux kernel being the attack surface (it is for the first vuln, but...) you have to attack a much smaller codebase that implements the VM.
In the case of Firecracker this is really hard and, depending on your exploit, if you end up controlling the Firecracker VM itself you're actually in yet another sandbox - so you now need another vuln to escape (although that sandbox is not amazing imo).
Run multiple workloads on the same CPU that you don't fully trust. VM-based containers will force the CPU to properly context switch between guests (e.g. using VMENTER on x86), normal process-based containers won't.
> Mapping files using DAX provides a number of benefits over more traditional VM file and device mapping mechanisms:
> Mapping as a direct access device allows the guest to directly access the host memory pages (such as via Execute In Place (XIP)), bypassing the guest kernel's page cache. This zero copy provides both time and space optimizations.
> Mapping as a direct access device inside the VM allows pages from the host to be demand loaded using page faults, rather than having to make requests via a virtualized device (causing expensive VM exits/hypercalls), thus providing a speed optimization.
> Utilizing mmap(2)'s MAP_SHARED shared memory option on the host allows the host to efficiently share pages.
From https://github.com/kata-containers/kata-containers/tree/main...