It depends on your definition of "container". The popular Docker-style container is substantially heavier than "just a process"; it creates cgroups, bind mounts things, creates new pid/user/mount/etc namespaces, all kinds of stuff. Such containers aren't just used for sandboxing; they are also used as a packaging mechanism (install whatever software versions you want) and deployment mechanism (special networking setups to connect containers) which partially explains why they are so heavy.
Most of that is not really necessary if you are just coming from a standpoint of "I want to sandbox a process". Indeed, if you just want to sandbox a process, just run it under a dedicated non-root user account, and it won't be able to unduly interfere with the rest of the system.
> Indeed, if you just want to sandbox a process, just run it under a dedicated non-root user account, and it won't be able to unduly interfere with the rest of the system.
This is simply not true, unless your definition of "sandboxing" only means "run it under another account" and absolutely nothing more -- but under this definition, non-root processes can 100% interfere with the rest of the system without proper control being put into place. "Intereference" can mean a lot more than "writes to my ~ dir" -- for example, monopolizing a system resource (e.g. IOPS or network bandwidth), adding code to the kernel (module autoloading) or simply using a resource at all when it shouldn't, or rogue problems like termination of a hostile/stupid set of processes (cgroups and namespaces are one of the few reliable ways to kill process groups in non-racey ways, etc).
> Most of that is not really necessary if you are just coming from a standpoint of "I want to sandbox a process".
No, a substantial amount of it is necessary if you actually want to sandbox a process in a way that's remotely isolated from the host system. Use cases like Flatpak and Docker are nice, but merely byproducts of the design that allow custom mount namespaces, etc. The original use cases for namespacing, cgroups, etc was precisely to give more fine-grained resource and isolation control over existing applications, in a hierarchical manner.
You might be interested in https://github.com/catern/supervise which does provide job-like behavior, using PR_SET_CHILD_SUBREPAER. I haven't written it down in that repo, but I believe that it exits in finite time even with an adversarial scheduler allowing arbitrary amounts of pid-wrap attacks. (and runs fairly efficiently in the absence of an adversary)
Contrary to your article, I think systemd's logic also exits in guaranteed finite time even with adversarial scheduling. You didn't take into account the fact that systemd is (I assume) not reaping zombies when it's doing its kill-loop. Each zombie that is left unreaped occupies a pid. Eventually, pids will be exhausted and processes inside the cgroup will no longer be able to dodge kill() by forking off new processes.
If my frequently given answer teaches anything it at least teaches not to assume that these things work, or how they work. ("systemd is (I assume) ...") Even reading the current source code is not enough. The systemd people have been around the houses several times with this mechanism, changing things and then changing them back again; and there is lots from them to read on the subject, bemoaning it. Do not assume; read.
Your supervise isn't a "superior API", of course. It does a whole bunch of parsing and constructing human-readable strings at runtime, which something that the 1980s and 1990s taught better than to do. A proper directly-program-usable binary interface actually already exists, as used by Daniel J. Bernstein's original supervise program from daemontools, and has been in widespread use for two decades with a whole bunch of tools in toolsets from a variety of people that speak it. I am one of the people who has documented it in detail. See the manual page for my service-manager command.
You are at about the same level that the supervise in Daniel J. Bernstein's daemontools was in (if memory serves) 1996, before experience taught some very important lessons about the daemonization fallacy.
Not only can Daniel J. Bernstein's daemontools demonstrate some very important lessons, so can the humble ps command (which uses "status" not "stat"). The systemd people (and indeed, again, the humble ps command) can teach how to loop over all of the processes in a system without including process #1 and potentially issuing tens of thousands or even millions of open() and kill() system calls almost all of which are useless and fail. Daniel J. Bernstein's UCSPI and systemd's LISTEN_FDS mechanism can teach ways of telling programs about the inherited open file descriptors that they are to use, that will interoperate with existing tools such as fifo-listen.
And even Stack Overflow can teach the error of assuming that sizeof(bool) is 1. (-:
>If my frequently given answer teaches anything it at least teaches not to assume that these things work, or how they work. ("systemd is (I assume) ...") Even reading the current source code is not enough. The systemd people have been around the houses several times with this mechanism, changing things and then changing them back again; and there is lots from them to read on the subject, bemoaning it. Do not assume; read.
Sorry, what are you even talking about? I pointed out a specific mechanism by which the following assertion made by you:
>A program that forks new processes quickly enough within the cgroup can keep systemd spinning for a long time, in theory indefinitely as long as suitable "weather" prevails, as at each loop iteration there will be one more process to kill. Note that this does not have to be a fork bomb. It only has to fork enough that systemd sees at least one more new process ID in the cgroup every time that it runs its loop.
might be incorrect. (That mechanism is that pid exhaustion will happen eventually, as long as systemd doesn't collect zombies.) Why don't you read the systemd source to confirm that this bug really does exist? Or at the very least, amend your article to admit that you have not done so!
I only know for a fact that my own "supervise" does not have this bug, and since I am using kernel APIs which were originally created by the systemd developers, I would guess that systemd does not have this bug either. But you're the one stating as fact that systemd does have this vulnerability, so I think you have a bit more of the burden of proof! :)
Anyway, regarding your other comments about my supervise utility. Yes, it has a human-readable plain-text interface, so what? It's still got a feature that no other tool has: It allows daemons to fork off their own children, without any risk of bugs causing stray processes to escape supervision, without requiring privileges. I would love to learn of another tool (on Linux) that has this feature, so please tell me! djb's supervise certainly is not capable of this, nor is nosh, nor is any other daemontools derivative that I know of.
>before experience taught some very important lessons about the daemonization fallacy.
What exactly is that fallacy? :) Perhaps you just mean "daemonization is a bad idea", in which case I fully agree?
>The systemd people (and indeed, again, the humble ps command) can teach how to loop over all of the processes in a system without including process #1 and potentially issuing tens of thousands or even millions of open() and kill() system calls almost all of which are useless and fail.
Yes, my supervise utility is currently not very optimized at shutdown, so what? Starting and stopping processes is a rare operation. :) I will optimize it later, when optimization is needed...
>Daniel J. Bernstein's UCSPI and systemd's LISTEN_FDS mechanism can teach ways of telling programs about the inherited open file descriptors that they are to use, that will interoperate with existing tools such as fifo-listen.
A little condescending, don't you think? I am well aware of these tools. They hardcode logic about where to find file descriptors. I prefer the CloudABI argdata style, explicitly passing in the file descriptor number to use, so that conflicts can be avoided. I find that explicitly passing in the fd number allows for more interoperation with existing tools, not less.
>monopolizing a system resource (e.g. IOPS or network bandwidth)
People have been placing ulimits on users to limit their resource usage for decades, using cgroups to do it is not a novel concept, just more effective.
>rogue problems like termination of a hostile/stupid set of processes (cgroups and namespaces are one of the few reliable ways to kill process groups in non-racey ways, etc).
The ability for a user to leave processes around on the system has been fixed for years now, with systemd-logind which puts each user in its own cgroup.
>adding code to the kernel (module autoloading)
This is a concern both for container systems and regular user isolation, and the only reliable way to fix it for either is to disable module autoloading, or at least restrict the available modules to well-maintained ones.
>The original use cases for namespacing, cgroups, etc was precisely to give more fine-grained resource and isolation control over existing applications, in a hierarchical manner.
Yes, and they failed. The design of Linux namespaces and cgroups is awful. You can perceive this by the fact that they require privileges to use. Hopefully in a few years all this Flatpak/Docker churn will output some namespacing APIs for Linux that are actually good.
You just described containers :)