But doesn't that mean we are merely reimplementing old concepts on a new level? Almost everything we do on VM level we already did on process level:
- Read-only VM images are essentially like statically linked binaries, just better encapsulated when they run.
- Tiny, fast loading VM images are essentially stripped static binaries with dead code elimination.
- Starting those by incoming network request is just like old "inetd" versions started custom handler programs on each request.
- Same with CGI, just on HTTP rather than TCP level.
- Manual maintaining the lifecycle of those VMs like FastCGI in nginx.
- Maintaining the lifecycle of such a started VM is essentially like FastCGI in Apache (or nginx + auto-spawner).
The only real difference is the improved encapsulation (controlled network access and resource consumption) of the running code, isn't it?
What might come next?
- Multiple VMs could share the same images to save memory, essentially reinventing shared libraries and dynamic linking.
- Network traffic into and out of a VM is controlled more tightly, reinventing SELinux. (Perhaps followed by noticing that these profiles are hard to maintain, rediscovering the maintenance issues of SELinux systems.)
- Providing the network traffic shape directly with the image, so these can be maintained at a single place, by the original developers. This would reinvent Pledge.
- Introducing permissions for network traffic, e.g. such that just one group of application VMs can communicate with the database VMs. Ideally with an ACL system as used for managing file permissions. This would reinvent Unix Sockets.
Yes, all this "container per request" or "unikernel per request" stuff is essentially the same as what everyone was doing 20 years ago with basic CGI. But this stuff also has far worse performance than CGI's "process per request", since processes can be started very fast, in microseconds.
It would be more interesting if we just made more sandboxing powers available for traditional processes, though there is already quite a lot of sandboxing available for Linux processes...
It depends on your definition of "container". The popular Docker-style container is substantially heavier than "just a process"; it creates cgroups, bind mounts things, creates new pid/user/mount/etc namespaces, all kinds of stuff. Such containers aren't just used for sandboxing; they are also used as a packaging mechanism (install whatever software versions you want) and deployment mechanism (special networking setups to connect containers) which partially explains why they are so heavy.
Most of that is not really necessary if you are just coming from a standpoint of "I want to sandbox a process". Indeed, if you just want to sandbox a process, just run it under a dedicated non-root user account, and it won't be able to unduly interfere with the rest of the system.
> Indeed, if you just want to sandbox a process, just run it under a dedicated non-root user account, and it won't be able to unduly interfere with the rest of the system.
This is simply not true, unless your definition of "sandboxing" only means "run it under another account" and absolutely nothing more -- but under this definition, non-root processes can 100% interfere with the rest of the system without proper control being put into place. "Intereference" can mean a lot more than "writes to my ~ dir" -- for example, monopolizing a system resource (e.g. IOPS or network bandwidth), adding code to the kernel (module autoloading) or simply using a resource at all when it shouldn't, or rogue problems like termination of a hostile/stupid set of processes (cgroups and namespaces are one of the few reliable ways to kill process groups in non-racey ways, etc).
> Most of that is not really necessary if you are just coming from a standpoint of "I want to sandbox a process".
No, a substantial amount of it is necessary if you actually want to sandbox a process in a way that's remotely isolated from the host system. Use cases like Flatpak and Docker are nice, but merely byproducts of the design that allow custom mount namespaces, etc. The original use cases for namespacing, cgroups, etc was precisely to give more fine-grained resource and isolation control over existing applications, in a hierarchical manner.
You might be interested in https://github.com/catern/supervise which does provide job-like behavior, using PR_SET_CHILD_SUBREPAER. I haven't written it down in that repo, but I believe that it exits in finite time even with an adversarial scheduler allowing arbitrary amounts of pid-wrap attacks. (and runs fairly efficiently in the absence of an adversary)
Contrary to your article, I think systemd's logic also exits in guaranteed finite time even with adversarial scheduling. You didn't take into account the fact that systemd is (I assume) not reaping zombies when it's doing its kill-loop. Each zombie that is left unreaped occupies a pid. Eventually, pids will be exhausted and processes inside the cgroup will no longer be able to dodge kill() by forking off new processes.
If my frequently given answer teaches anything it at least teaches not to assume that these things work, or how they work. ("systemd is (I assume) ...") Even reading the current source code is not enough. The systemd people have been around the houses several times with this mechanism, changing things and then changing them back again; and there is lots from them to read on the subject, bemoaning it. Do not assume; read.
Your supervise isn't a "superior API", of course. It does a whole bunch of parsing and constructing human-readable strings at runtime, which something that the 1980s and 1990s taught better than to do. A proper directly-program-usable binary interface actually already exists, as used by Daniel J. Bernstein's original supervise program from daemontools, and has been in widespread use for two decades with a whole bunch of tools in toolsets from a variety of people that speak it. I am one of the people who has documented it in detail. See the manual page for my service-manager command.
You are at about the same level that the supervise in Daniel J. Bernstein's daemontools was in (if memory serves) 1996, before experience taught some very important lessons about the daemonization fallacy.
Not only can Daniel J. Bernstein's daemontools demonstrate some very important lessons, so can the humble ps command (which uses "status" not "stat"). The systemd people (and indeed, again, the humble ps command) can teach how to loop over all of the processes in a system without including process #1 and potentially issuing tens of thousands or even millions of open() and kill() system calls almost all of which are useless and fail. Daniel J. Bernstein's UCSPI and systemd's LISTEN_FDS mechanism can teach ways of telling programs about the inherited open file descriptors that they are to use, that will interoperate with existing tools such as fifo-listen.
And even Stack Overflow can teach the error of assuming that sizeof(bool) is 1. (-:
>If my frequently given answer teaches anything it at least teaches not to assume that these things work, or how they work. ("systemd is (I assume) ...") Even reading the current source code is not enough. The systemd people have been around the houses several times with this mechanism, changing things and then changing them back again; and there is lots from them to read on the subject, bemoaning it. Do not assume; read.
Sorry, what are you even talking about? I pointed out a specific mechanism by which the following assertion made by you:
>A program that forks new processes quickly enough within the cgroup can keep systemd spinning for a long time, in theory indefinitely as long as suitable "weather" prevails, as at each loop iteration there will be one more process to kill. Note that this does not have to be a fork bomb. It only has to fork enough that systemd sees at least one more new process ID in the cgroup every time that it runs its loop.
might be incorrect. (That mechanism is that pid exhaustion will happen eventually, as long as systemd doesn't collect zombies.) Why don't you read the systemd source to confirm that this bug really does exist? Or at the very least, amend your article to admit that you have not done so!
I only know for a fact that my own "supervise" does not have this bug, and since I am using kernel APIs which were originally created by the systemd developers, I would guess that systemd does not have this bug either. But you're the one stating as fact that systemd does have this vulnerability, so I think you have a bit more of the burden of proof! :)
Anyway, regarding your other comments about my supervise utility. Yes, it has a human-readable plain-text interface, so what? It's still got a feature that no other tool has: It allows daemons to fork off their own children, without any risk of bugs causing stray processes to escape supervision, without requiring privileges. I would love to learn of another tool (on Linux) that has this feature, so please tell me! djb's supervise certainly is not capable of this, nor is nosh, nor is any other daemontools derivative that I know of.
>before experience taught some very important lessons about the daemonization fallacy.
What exactly is that fallacy? :) Perhaps you just mean "daemonization is a bad idea", in which case I fully agree?
>The systemd people (and indeed, again, the humble ps command) can teach how to loop over all of the processes in a system without including process #1 and potentially issuing tens of thousands or even millions of open() and kill() system calls almost all of which are useless and fail.
Yes, my supervise utility is currently not very optimized at shutdown, so what? Starting and stopping processes is a rare operation. :) I will optimize it later, when optimization is needed...
>Daniel J. Bernstein's UCSPI and systemd's LISTEN_FDS mechanism can teach ways of telling programs about the inherited open file descriptors that they are to use, that will interoperate with existing tools such as fifo-listen.
A little condescending, don't you think? I am well aware of these tools. They hardcode logic about where to find file descriptors. I prefer the CloudABI argdata style, explicitly passing in the file descriptor number to use, so that conflicts can be avoided. I find that explicitly passing in the fd number allows for more interoperation with existing tools, not less.
>monopolizing a system resource (e.g. IOPS or network bandwidth)
People have been placing ulimits on users to limit their resource usage for decades, using cgroups to do it is not a novel concept, just more effective.
>rogue problems like termination of a hostile/stupid set of processes (cgroups and namespaces are one of the few reliable ways to kill process groups in non-racey ways, etc).
The ability for a user to leave processes around on the system has been fixed for years now, with systemd-logind which puts each user in its own cgroup.
>adding code to the kernel (module autoloading)
This is a concern both for container systems and regular user isolation, and the only reliable way to fix it for either is to disable module autoloading, or at least restrict the available modules to well-maintained ones.
>The original use cases for namespacing, cgroups, etc was precisely to give more fine-grained resource and isolation control over existing applications, in a hierarchical manner.
Yes, and they failed. The design of Linux namespaces and cgroups is awful. You can perceive this by the fact that they require privileges to use. Hopefully in a few years all this Flatpak/Docker churn will output some namespacing APIs for Linux that are actually good.
It's all part of the endless fight against proprietary systems. When cloud providers started using 'optimized' images the response was to move everything above the VM to 'container' level. The same thing happened in the VM boom once hardware makers started adding binary blob drivers to bare metal.
We'll just keep building layers as manufacturers try to lock the lower ones down :)
Basically, the cost of spinning VMs is coming down drastically (with Hyper.sh and Intel ClearContainers offering 200ms boot) and it is now feasible to run your apps on a personal cloud sandbox. You data resides in that box and runs code you have approved; away from the prying eyes of FB, Google and the government. Everyone's facebook will be an individually hosted app, talking to each other over agreed upon protocols; as opposed to exchanging data within the facebook infra.
Of course, a lot of UX and engineering problems need to be solved along the way. Provisioning a personal sandbox cannot be any harder than creating an account on a Social Network. Beiber's sandbox will need to distribute data to 105M followers efficiently. You'll need to patch vulnerabilities at scale, so there needs to be some trust model for applying patches. You need relays to overcome censorship etc.
What you - and the personal facebook case - are really describing is IndieWeb, and already developing movement of home-building your social media and linking with others implementing open protocols.
The main issue just now is that writing (or deploying) IndieWeb platforms and hosting them is the preserve of geeks, and it will not take off and displace current platforms until we can make it trivial for Mom and Dad to join us.
Sounds like https://urbit.org. Their vision is to build a personal cloud atop existing infrastructure where apps/data are all user owned, decentralized, among other things (globally referentially transparent file system, auth/identity services provided by the network, guaranteed message delivery, etc.).
Wouldn't this be good for shutting down unused VMs in order to save money when renting infrastructure? Most website can go "down" during the night and when a user tries to connect to it, it would take 0.2s extra, but after that it would stay on.
Now that I think of it, spiders/bots might activate it beforehand. You could use robots.txt, but not all spiders/bots seem to follow those "rules".
Heroku actually does this for their free instances. They "sleep" servers after iirc 30 minutes of inactivity, and they wake up again after the first request, which takes a few seconds longer as it boots up.
What does a website really do if it does nothing? What money does it consume? What else can you do with the machine it's on? Wouldn't moving VMs around accomplish the same goal?
It's definitely impressive, kinda like an own serverless implementation.
However, is there a point in shutting down the VM right away, other than being a cool demo? It seems to me that you could quite easily keep it around if there are more requests queued, and quickly spawn more workers if the queue gets too long. So a regular worker pool, just that the number of workers can be adjusted relatively quickly.
Shutting them down right away achieves a number of things, such as significantly better isolation between requests, and more protection against resource leaks.
Whether that'll be worth the cost remains to be seen - 250ms is still far too long for most uses, but if it can come down enough, it's attractive.
I couldn't get to the site, but assuming this is running Linux I wrote a paper about how fast you could boot a Linux minimal userspace on QEMU a couple of years ago[1]. In short if you're willing to heavily customize the kernel, qemu and userspace then you can get it down to 100-150ms on i7-type hardware, but probably not much lower. Using a distro kernel, distro qemu and so on, you're stuck with 400-600ms (which is fine for certain uses, eg better security for containers like Kata Containers is doing).
Also worth noting I wrote a bunch of benchmarking tools to help optimize this case[2].
Thanks for the links. I've looked at Clear Containers/Hyper/Kata before, and it's absolutely looking good. It's very interesting to see your work on going into details on it, though.
I really hope this can be pushed further down to the point where it gets viable to for more use cases. Of course, we can always use something lighter than Linux, but being able to start with Linux and pare down is potentially so much easier for many types of applications.
It's similar in intention, but "serverless is just `cgi-bin`" misses a lot of nuance. Zerg and cgi-bin are solving different problems and with different implementations. It's extremely difficult, if not impossible in some environments, to get the same kind of isolation with cgi-bin that you do out of a VM.
Is it? I'm under the impression that cgi-bin is a way to serve requests from the local computer, while Zerg is a way to spawn _other_ server instances to handle the request (and in Erlang, specifically).
Yes, but you could let a load balancer handle that for you, and you might not need as many computers to handle the requests if it didn't take 250ms to start.
Booting a VM should be "as complex" as fork. And the memory image of the booted OS should be identical on each load before the end user specific data is loaded.
"Prefork" these VMs by spraying that image over physical memory in advance, jump to them on a request. memcopy a new VM over them on exit.
VM spawn rate should be the memcopy rate, if the MMU is powerful enough, should only be the amount of memory mutated during a request.
I have written a few web services in the past where this kind of isolation would have been great. Services doing file conversion of untrusted input with software that isn't hardened for that case (in my case converting PDF to png and postscript).
It seems I managed to get through right before it went down. Sort of a neat idea. But is it really worth starting up a new VM for each request? That seems to be the majority (380 ms?) of the time spent.
Slightly off-topic: the page lists be time for an android phone to boot as 50 seconds. Is that correct? I remember iPhones used to take forever but I recently realized that they sing to boot extremely fast. I have no idea when that change happened.
Just tested my Pixel 1. It takes 24 seconds from button press to lock screen, and another 18 seconds for me to type my password and display everything on the home screen, but I'm a slow mobile typer and my password is 16 characters.
Just tested a Pixel 2 XL. 12 seconds to lock screen and 4 more to the home screen including my pattern entry. I think this is the fastest booting phone I've ever had.
Actually I had to do it twice because the first time it installed a system update, but even then it only took 26 seconds to install, finish booting, and show the lock screen. System updates have been seriously optimized in the last few Android releases. They used to take like 5 minutes.
> System updates have been seriously optimized in the last few Android releases. They used to take like 5 minutes.
On the Pixel phones Google uses two system partitions. One is the one you're currently running and the other one gets the update applied to it in the background without you having to stop using your phone. When you reboot it just has to flip which system partition it boots from and maybe do some first boot actions. If the phone starts up successfully it'll then apply the update to the original partition to keep them in sync.
Thanks. I wonder if that’s for some random $50 low end phone.
My iPhone 8 was 20s to login screen. It basically instant once I login but that may be because my password takes a while to type and it’s loading the rest while I type.
That makes sense, but even if everything else takes three seconds (that’s overwhelming the small VM book time) I’m not sure what the real benefit of this is.
I’m sure it’s technically more secure than a sandboxed application running in its own VM in its own process but by how much?
It's an implementation of the Erlang VM on top of Xen. Xen is a virtualization hypervisor. Erlang VM is a virtual machine which runs Erlang bytecode. The default implementation is called BEAM and it runs on *nix, Windows and other architectures. This implementation is more lightweight and only runs on top of Xen. One of its main features is a fast boot time.
So to recap instead of the typical stack of:
Erlang Code | BEAM VM | Linux or other OS | VM or Hardware
You get:
Erlang Code | Ling | Xen | Hardware
The use cases could be isolation, scaling or what authors calls 0-footprint cloud. That is, only spawn instances as you process requests. When requests are not coming, you don't have any instances running. In more modern terms maybe it could be used for building a serverless architecture used to run cloud "functions"
Yes there is obvious irony here with the demo not handling requests well after the HN frontage hug of death, but I think it is still a beta / experimental product, it has been around for a good number of years and is somewhat abandoned. If anyone knows more please feel free to correct me.
Actually, we had a POSIX port of Ling of alpha quality (e.g. no access to file system, a prototype of network connections based on libuv, etc) and experimented with running it on MIPS microcontrollers natively (not a great idea due to memory constraints and lack of drivers, but it could do simple computations).
You might want to look at OSv[0], an extremely lightweight OS for running a singular JVM process on cloud vms. There's a decent list of unikernels (including OSv and LING) here[1].
It's a unikernel of sorts called Erlang on Zen (the VM implementation they wrote is called Ling). It does not use the BEAM runtime so you'll likely need to recompile the Erlang code but otherwise, Erlang is a simple language so I expect most things would work.
Given that this was written roughly 4 years ago (guessing by some dates on the site), its not clear what their support looks like for newer features like maps might be.
Performance looks good (relative to BEAM) but they are also single-threaded which means you'd need to cluster your instances to do multi-processing... and I'm not sure that'd mix well with ephemeral nodes. They talk a little more about this on their site: http://erlangonxen.org/more/clustering
I always thought ZeroVM was a cool idea in this space. The NaCL runtime makes a lot of sense as a base for servers that can take advantage of existing libraries.
After reading about ZeroVM I was incredibly excited - but then saw the last Git commit was in January 2015 =[
Is it considered "stable" or is it simply abandoned? Are there any successors? Near zero overhead VM acting on data storage is something I'll need for a planned project of my own.
It is abandoned, Rackspace laid them off. NaCl is still stable, although the team is working on Web Assembly now which is a replacement (but less mature). ZeroVM does still work but is a pain to build.
Didn't Rackspace acquire ZeroVM during it's 2010-2013 SaaS buying spree? I'd anticipated seeing ZeroVM as some part of their "cloud" offering but it looks like it was rolled into OpenStack Swift. Has Cloud Files performance improved at all in the past 4 years and how much of this can be attributed to ZeroVM (a lot of ignorance in this question I know).
This is the future of computing/software -- pure commodity of CPU+memory. Applications that build on top of these granular interfaces will largely the next wave of successful architectures.
I'm thinking of building something here for static sites as an open source project/spec. If anyone is interested in bouncing ideas on this, please ping me via email (in profile).
I agree with your first paragraph. But why would this provide any benefit for a static site where no compute is needed (or all compute is client side)?
My bad -- I meant static sites with one additional functionality that all static sites try to outsource: comments.
(it would no longer be strictly static, but the idea is to decouple site from hosting/compute, so it would be easy to port a static/dynamic site between providers at the switch of a DNS entry).
Yes, Zerg is powered by Ling. If you scroll down you can see how it's involved in the request summary. A cached copy of the page is in a sibling comment
> The demo is limited to 16 concurrent instances and 2 libvirt connections. Due to these limitations we were unable to spawn a new instance to service your request. Please try again later.
Not really; those are exactly the sort of config settings you'd sensibly use to prevent someone from DDoSing your untrusted prototype-level infrastructure.
If your VMs are truly cheap, and you've ironed out the kinks from your proof-of-concept, you'd obviously set up your production system to run more than 16 of them.
Right, and who decides that a request coming from a customer ot an attacker? If the attacker can generate enough requests non of the customers have a chance to use your service. This is absolutely not the way to handle DDoS.
My point was that it's a sensible strategy for the demo of a technology that's still under development (like this project), where you don't actually have any "customers." In production, you'd let it scale unboundedly, and combine it with some sort of anti-DDoS technology like Cloudflare.
The point is to show boot times. If you are scaling up/down in response to load, this is a nice quality to have. Most folks have really spiky traffic patterns so this can save you a lot of money.
Obviously no one would ever run something like this 1 vm per request thing irl.
> Obviously no one would ever run something like this 1 vm per request thing irl.
I can see plenty of use-cases for doing just that. Large uploads, time-consuming request/responses such as server-side data processing, RPC, as a backend behind a caching front-end so that it only has to respond to invalidated cache entries, etc.
I don't see many people using this to actually serve general website requests though. It'd probably be modified to serve multiple requests until nothing is left to do and then exit.
It depends on how many resources it takes to run those 16 instances. If it's an extremely small demo allocation, then that's not a big deal. If it's multiple gigabytes and cores then the efficiency is awful.
But doesn't that mean we are merely reimplementing old concepts on a new level? Almost everything we do on VM level we already did on process level:
- Read-only VM images are essentially like statically linked binaries, just better encapsulated when they run.
- Tiny, fast loading VM images are essentially stripped static binaries with dead code elimination.
- Starting those by incoming network request is just like old "inetd" versions started custom handler programs on each request.
- Same with CGI, just on HTTP rather than TCP level.
- Manual maintaining the lifecycle of those VMs like FastCGI in nginx.
- Maintaining the lifecycle of such a started VM is essentially like FastCGI in Apache (or nginx + auto-spawner).
The only real difference is the improved encapsulation (controlled network access and resource consumption) of the running code, isn't it?
What might come next?
- Multiple VMs could share the same images to save memory, essentially reinventing shared libraries and dynamic linking.
- Network traffic into and out of a VM is controlled more tightly, reinventing SELinux. (Perhaps followed by noticing that these profiles are hard to maintain, rediscovering the maintenance issues of SELinux systems.)
- Providing the network traffic shape directly with the image, so these can be maintained at a single place, by the original developers. This would reinvent Pledge.
- Introducing permissions for network traffic, e.g. such that just one group of application VMs can communicate with the database VMs. Ideally with an ACL system as used for managing file permissions. This would reinvent Unix Sockets.