Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ansible vs. Chef (2015) (tjheeta.github.io)
170 points by fanf2 on March 9, 2016 | hide | past | favorite | 99 comments


Ansible has scaled really poorly to thousands of hosts for us. Things we have run into:

- Running a job against a single host will finish in 3 minutes... running that exact same job against thousands will take well over an hour and max out your machine.

- Running against more than around 3k hosts will somehow consume all 60GB of RAM and trigger the oom-killer

- CPU usage on the ansible runner is absurd for a large amount of hosts. We're currently using a c4.8xlarge (our biggest box) just to run deploy jobs and have them finish in a reasonable amount of time (10-15 minutes)

Slicing up our inventory into chunks and running them on different servers sucks big time and is pretty hacky. How do I combine the results? Can't do orchestration like "Run X on these roles first, then run Y on these roles when you're done".

Most likely what I'm going to do is have a single server execute ansible doing only the following in async (aka CPU friendly) mode:

- Upload a current copy of ansible to S3

- Upload the configs to the target machines with ONLY the secrets that role needs in plain text. (I'm not putting my vault secret on every box!)

- Have the servers pull it down and execute in --connection=local mode.

- Wait until each remote finishes

All that said, I LOVE LOVE writing stuff in Ansible. It is so easy to read, follow, and understand. I picked up most of it in a day or two just by reading their "Best Practices" page. Getting it to work at scale hurts though :(


> It is so easy to read, follow, and understand

I never liked that the variable namespace is global, so there isn't anyway for a module to be self-contained. If you execute Module 1, and then Module 2, Module 1 can set a variable that inadvertently affects Module 2. The "recommended" way around this is to prefix all variable names, but this becomes unwieldy very quickly as your variable names grow in size. The "nicer" way to do it would be to set have a dict/hash of variables, but that makes top-level overrides difficult because there is no way to override "hash_name.variable_name" you basically have to override the whole "hash_name" variable or nothing at all.

I found it difficult sometimes to reason about how these variables would work, and if I needed to add something to defaults.yml or variables.yml in a module.


Agreed. In fact variables are scoped by "play" not by "role" which is almost weirder. For those who don't know the organizational structure from largest to smallest is playbook->plays->roles->modules/tasks. Variable scope is at the "play" level.

More than that, roles are meant to be distributable components on the Ansible Galaxy service they run. Galaxy gets almost no use because modularity and reusability is broken by having no idea how the role author namespaced their variables. Collisions happen all the time. Why we must manually manage scope with naming conventions when the computer can do it automatically with scope is beyond me.

This is Ansible's biggest downside in my opinion. I've talked with the core devs about it on IRC and they (bcoca) agreed but thought it too late to make such a pervasive change as introducing role-level scope.

As some of the other posts have mentioned, I still love Ansible despite this shortcoming.


Can't agree enough here. The core deva act like it isn't a problem. The lack of encapsulation makes reusability impossible


I remember being really down on Ansible Galaxy, and whenever someone tried to use a community role, I'd pull it down, audit it, and make them fork it, because it was inevitably dangerous, not thought out, and with no tests. Now I'm back in a Chef shop, and Chef has a ton of tooling for re-usability, has put loads of thought and effort into the problem, and there are tons of cookbooks, many maintained by Chef, Inc. The problem isn't really any better though- the official and officially blessed cookbooks are still terrible, broken and unsafe in all sorts of obvious ways. I'd rant more, but I'd have to get specific and mean about actual people.

It's the problem space, honestly. It all depends on the guardrails your workflow provides, and there's never enough in common.

Anyway, there's a reason everyone loves golden-images.


Good point. The Ansible community who happened to be talking to me on IRC about this basically said that at least at this point, you have to look at all your role code no matter what anyway. Meaning scope wasn't the core issue. But lack of variable scope even hurts my own Ansible config abstractions.


> Running a job against a single host will finish in 3 minutes... running that exact same job against thousands will take well over an hour and max out your machine.

Have you used the variety of options to throttle this down? Running against 3,000 hosts at once is, in my opinion, absolutely crazy. And almost an contradiction to:

> Slicing up our inventory into chunks and running them on different servers sucks big time and is pretty hacky. How do I combine the results? ...

How do YOU combine them all? All results from all 3,000 hosts.? And it's not "hacky" at all, it's what you should be doing as a systems engineer/architect in the first place. No one network should be 3,000 hosts in size in a flat network structure; that should be partitioned up into logical arrangements for easier management (which you currently don't have, as per your own points), not to mention security.

As for combining the results. Have you considered writing a simple module for Ansible which sits locally and it called after every task is complete? It's easy to work with and you can push the results into ElasticSearch.

Ansible isn't a magic bullet, it's a tool on which to build, so build on it :-)


> - Running against more than around 3k hosts will somehow consume all 60GB of RAM and trigger the oom-killer

Have you looked at Salt (salt stack)?


If you're using yaml and templates then you've already lost. The only tool in this game that is not braindead is chef. Sometimes you need imperative things and conditional logic with iteration. If you don't have a real programming language then the contortions you have to go through gets really old really fast.

As for the deployment patterns. If you're in the cloud then you should be baking AMIs (or equivalent in your cloud provider) and shipping your configuration the same way you ship your application code, as native packages like .deb or .rpm. If you jumped on the docker bandwagon then your hosts are basically there to look pretty and host the containers which means you have some other way of getting configuration to your servers, i.e. etcd, consul, etc. so the problems brought up in this post don't exist in that setting. You are also probably using some kind of container orchestration system like kubernetes so again the problem of orchestration and deployment is offloaded to some other system. The only problem you have in that setting is doing a rolling deploy of containers and halting when things go wrong.

I think the only place any of these tools make sense now is some private on-premise cloud. Ever other place has already moved on.


Yes! Please, for the love of all that is holy, please quit writing tools that make me write code in something that isn't a general purpose programming language! Didn't we learn from Ant and xml?

I really want to see a clojure/clojurescript based config management system -- it would be so pleasant to write EDN/sexps for basic config, and yet have it be a real language when you need to do something hard.

Edit:

Forgot to mention, pallet [1] is something like that, but unfortunately it appears to be mostly dead.

[1]: http://palletops.com/


>Yes! Please, for the love of all that is holy, please quit writing tools that make me write code in something that isn't a general purpose programming language! Didn't we learn from Ant and xml?

Yes, we learned to use less powerful languages where it was appropriate because they're more readable and less susceptible to technical debt.

This principle, in other words : https://en.wikipedia.org/wiki/Rule_of_least_power

Ant XML was as powerful as Java - it was turing complete and terribly designed to boot. That was its primary failing.

Likewise, using turing complete PHP to generate HTML was never as clean as using a less powerful templating language (like jinja2) to generate the HTML. Separation of concerns with a language barrier is a good thing.

If all of this means nothing to you, you've probably created some huge code messes in your time.


> Likewise, using turing complete PHP to generate HTML was never as clean as using a less powerful templating language (like jinja2) to generate the HTML.

I strongly agree with your point, but Jinja2 is Turing Complete as well (it's still preferable to PHP though).


I don't really know enough computer science to validate this idea, but I can sense that there are different levels of "power" among turing complete languages (and also among non-turing complete languages). And Jinja2 < Python/PHP, despite all three being turing complete.

Metaprogramming / C++ style templating, for instance, goes above and beyond the power provided by regular turing complete programming constructs, and while that means that you can do cool stuff with them you couldn't easily do otherwise, they're a massive headache to reason about, debug, and keep free from technical debt.

Similarly, when you take blocks of code and "lower the power" to make it declarative instead of imperative (e.g. using list comprehensions instead of for loops) it almost inevitably ends up cleaner.


We're using pallet. Even though it's not trivial to get started with and the docs are quite sparse, it provides a solid foundation for scripting the configuration management tasks. I much prefer it to Ansible/Chef/Puppet. However, as you're exposed to the full programming language, there are much less constraints to what you can do. You'll need a layer of rules (a framework) of your own if you don't want to end up in a mess.

Activity of the core project seems to be low, but that's true for many other mature Clojure projects. You can still use them successfully.


The only tool I've used that used native language was Buildbot, from which I still have the scars. Not because of the native language, but from the versioned documentation that didn't match behaviour of the given version, and the complex workflow that had zero examples because "everyone's setup is different" - so you had to figure out their words for terms, and build your config from scratch. It was an entirely unpleasant experience.


We also use pallet. Harder to grok in the beginning since you have to learn the lib but it's pretty amazing what you can do with a full language.

If anyone is frustrated by the speed of jclouds (which pallet uses internally), I recommend trying pallet's Amazon-specific driver; it's much faster.


This seems to be a common problem among us engineers: that we are certain we know the Correct Tool, and everyone else is either stupid, ignorant, or misled. Or all three.

Interestingly, I've found that a focus on solving the problem at hand (say, installing a package on a hundred systems quickly and predictably) rather than a focus on using the Absolute Best Tool is much more likely to lead to a pleasant work experience for everyone involved.

That, and "the Best" is frequently subjective and, even when accurate, tends to get replaced in a few weeks by something else.


I'm ok with limitations when the limitations actually buy me something. I've used all the tools in the space and they all manage to miss the mark. If I'm giving up programmability then I'd better get something in exchange.

If that something is templates that I can't reason about if the right context is not available then that's not solving any problem. First and foremost whatever I use I must be able to reason about and debug. Strings concatenated together with other strings based on some weird rules is anything but reasonable and debuggable.


I'm pro-baking AMIs. I do it every day and have automated and wrapped it up with Jenkins/Packer etc. I don't get your point though. Something, ansible, puppet, something has to put all the config in place. I've done the whole rpm for everything approach before too. It didn't end that well to be honest.


There may be a scale of braindeadness, but chef still scores pretty high on it. For example, chef server cannot tell a difference between a new node and an update of node state. (and that's enforced by the API design, not specific implementation) Just think of all the weird things that can happen this way in a very dynamic environment.

(example: if you made some mistake while provisioning new hosts, you'll have one node data overriding another, without any notification)


How do you manage that exactly? Using the same client certificates? Client names are unique, so you can't have duplicate entries of the supposed same node.


When cloning machines for example you can race to replace the credentials. Also when deleting nodes, you need to make sure you first delete the client, then node - otherwise the node can get silently recreated and mess up your discovery. Also when... (quite a few possibilities).


Ah, yes. It's somewhat annoying that there isn't a clean way to remove both client and node in one go. But when it comes to cloning the images should be prepped before cloning. Or just don't clone.


> The only tool in this game that is not braindead is chef.

Yeah, right... https://docs.saltstack.com/en/latest/ref/renderers/all/


So not only do I get to write yaml or some other template nonsense but I get to choose the dialect as well. Yes, much better.


It's clear you're a developer and not a systems engineer.


I like to think I'm a generic problem solver that through accidents of history has ended up using software instead of pencil and paper to solve problems. In another life pencil and paper was adequate. Systems, programming, administration is all the same to me. I don't discriminate. I also like to use actual programming languages instead of their yaml based bastardizations.


"I also like to use actual programming languages instead of their yaml based bastardizations."

I think the problem here is confusion, and I see it a lot in the Ansible community.

What you're doing, and I can completely understand why, is taking Ansible as being a programming language for defining state. That isn't what Ansible is designed to be.

That YAML you speak of is designed to allow you to define state in a format that is not only human readable, but also machine readable. All that YAML is meant to do is say: x=y. it's not trying to be a programming language or a scripting interface, it's simply a means of setting the state you want.

This is a common pitfall a lot of people fall into when they approach Ansible, and I'm always seeing questions on IRC like: "How do I code Ansible to pull down a file, read it, process it, and after doing some logic, do X?" The answer is: you're using Ansible wrong. Use it as a state management system and only a state management system, and you won't go wrong.

Use a scripting or programming language, write a script/program, and run that on the remote machine if you need highly complex logic to determine state.


That. Indeed there's a whole different mindset for people deploying and orchestrathing servers based on their 'previous' background.

On the previous iteration of CM, I noticed that people that had a prior background on development usually chose Chef, while the SysAdms chose Puppet.


> If you're using yaml and templates then you've already lost.

Probably true.

> The only tool in this game that is not braindead is chef.

What's wrong with s-expressions? Code and data at the same time.


>What's wrong with s-expressions? Code and data at the same time.

Sure, but you need something that will actually evaluate that sexp. May I suggest GNU Guix to fill that role?

http://www.gnu.org/software/guix/


I really like Guix's ideas, but I think that project made a very unwise choice when it picked Scheme rather than Common Lisp.


I have done several small deployments (under two dozen servers) with Chef, Puppet, and Ansible. I've also evaluated Salt and worked with other people's Bconfig systems. Ansible is the best for my situation, hands down. I no longer have any of the others in use.

Ansible is easy to reason about - it's never surprised me once in use. You have about an order of magnitude less to learn when compared with chef or bconfig.

Also, for setups with small target VM's, it's increadbly handy to not have to install a bunch of stuff on each server and make sure it doesn't conflict with anything else.

But mostly it's that Ansible can be understood enough without devoting a couple weeks of your life to it. And you can come back later and understand what you have written.


> it works reasonably well on the scale of thousands of hosts

I could see this if you're working from one really powerful machine... no, that won't work, it's constrained by SSH, not hardware specs.

I could see this if you're calling Ansible on another host... no, then you have to copy everything out to the sub hosts, who have to copy everything out to their sub hosts... A scalability nightmare.

You can use redis as a distributed store of truth and... wait, what? Now there's a blog post worth reading. Show us how to scale Ansible with real-world examples using redis and autoscaling groups. Please.

Got it. I can see this work reasonably well if you're willing to wait 10 hours for a deploy to complete. Personally, I'm not.

> If you want to have 1000 forks, that will cost about 30 GB of memory

Ansible is not, nor has it ever been, limited by available memory. It's limited by the number of concurrent SSH sessions it can handle while copying every single module to be executed to that host.

There's plenty of reasons and ways to use Ansible for deploying code. Some of the post has accurate and reasonable information, but the scaling portion is pure fantasy right now.


I had around a thousand of hosts and one small virtual machine, and my SSH call to each and every one (and running something trivial, like `uptime') took less than a minute in total. Though this was a custom script that used Erlang's built-in SSH client, and had some unpleasant trouble when hitting the limit on file descriptors (1024, bumped later to 4096 just because of that).


It's much less about just the ssh overhead, as it is about copying Python scripts to the target hosts, then executing them, one at a time.


I thought Ansible had different execution strategies and you could make the hosts not wait for each other if you wanted. Wouldn't that speed things up a good deal?


If someone will rewrite this article as Ansible vs. Puppet I will not only buy you a beer but I will throw a parade in your honor.


Puppet has pretty much all of the same problems.

Deploying Ansible: - SSH - SSH keys - Git - CI - ... fin?

Deploying Puppet: - Install the agent on every system (using Ansible, ironically) - Install a set of masters (need HA) - Install a message bus for mcollective (if this is still even a thing?) - Upgrade the agent on ALL of your hosts as and when there's an update or a security patch - Have an agent/process running on ALL of your hosts with root access to the box to execute any command a hacker managers to inject - ... oh you've installed Ansible? I'll stop now then.


Nobody is deploying Puppet (or Chef for that matter) using Ansible. You bake it into the images, use bootstrapping or any number of other ways, like cloud-config.

How do you deal with dynamic resources, like automatic scaling, in Ansible? You don't, because it can't. With Puppet and Chef you can actually deal with that.

If you don't have a rigid routine for updating packages on your systems you shouldn't manage systems. Security will always be an issue, thinking that Ansible solves all issues is naive.


"Nobody is deploying Puppet (or Chef for that matter) using Ansible. You bake it into the images, use bootstrapping or any number of other ways, like cloud-config."

How do you initially bake the original image? What installs Puppet or Chef on that original image? Is it Puppet or Chef? Can it boot strap its self? The answer is no. Something has to create that image and "bake in" the ingredients. Some people use Ansible for this.

"How do you deal with dynamic resources, like automatic scaling, in Ansible? You don't, because it can't. With Puppet and Chef you can actually deal with that."

But it can. You've clearly never tried, but instead made an assumption. I've used both Puppet and Ansible, at large government projects, and the Puppet approach was a complete nightmare.

For your auto-scaling example, I would use Sensu to monitor load and fire off a CI job when that load met some criteria (say 90%+ CPU/network for ten minutes.) That CI job would clone our Ansible repository and call a relevant playbook for spawning more VMs, configuring them accordingly. How many it spawned would just be defined in a YAML file, updated via git if you so chose. It's beautifully simple.

I'll say it again: none of these CM solutions are all-in-one magic bullets - they're platforms or frameworks to build on top of. With Ansible, I would build on top of it with dynamic inventories, custom modules for monitoring and auditing the run time, etc. I do this from my local machine without some monolithic cluster to manage or an agent on all my boxes running as root (or with root privileges - same difference) 24/7.

"If you don't have a rigid routine for updating packages on your systems you shouldn't manage systems."

We create immutable instances which are updated as and when required, launch those new instances, and have our LBs use the new instances, keeping the old ones up whilst connections fade away, then they're destroyed. What's so difficult about that?

What's your point with this remark?

"Security will always be an issue, thinking that Ansible solves all issues is naive."

Er, I never said it did? ... what?

It's security 101, in case you didn't know, to run as minimal amounts of software as possible to reduce the potential number of attack vectors. I can't remember the statistic directly, but when I did a cyber security course online via MIT, there was a statistic along the lines of: for every 1,000 lines of code, there's one bug and potentially, an exploit. The take away from this is: run as few services and software packages as possible to help reduce the number of attack vectors.

On that note, are you arguing that running Puppet/Chef in a master/agent configuration does NOT introduce more services, and thus more code, to maintain, secure, patch, make resilient, keep up, etc? Are you also saying that Ansible's default mode of operation (push and run versus pull and run) over SSH is not more secure? If you believe the latter, than I believe, with all due respect, that it is you that should potentially not be managing systems (I didn't want to resort to what is clearly a personal attack, but hey, you made this a school playground to begin with ["If you don't have a rigid routine for updating packages on your systems you shouldn't manage systems."] - shrug)

-M


So your argument is that anybody using Chef or Puppet, or just about any master/agent-based CM shouldn't manage systems. I guess a lot of people would be out of jobs then.

There are a number of tools to "bake it" into the original image. Most people don't really "bake it", they bootstrap it using things like cloud-init. And it can most definitively bootstrap itself into an environment.

While a novel approach to automatic scaling it seems rather hacky, instead of using the tools provided by whatever provider you're using, that are built for this exact thing. I can see a number of things that can, and probably will, go wrong with that setup. Sensu having a hiccup, f.ex.

And your assumptions about me not trying Ansible is incorrect. I've used it, and I like some of the concepts of it. But for anything other than one-off configurations or smaller environments, moving to Puppet was an easy choice. Speed being one of the biggest factor, and Puppet is slow as fuck compared to say, Chef. How do you do testing in Ansible?

The last remark wasn't an attempt at "school playground", just a factual statement to counter the argument the article raised about old versions of Chef. If you can't upgrade your packages, the why are you even looking at CM?

I never said that Ansible was insecure, or somehow more insecure than Puppet or Chef, I'm not sure where you even think I wrote that.


"So your argument is that anybody using Chef or Puppet, or just about any master/agent-based CM shouldn't manage systems. I guess a lot of people would be out of jobs then."

I said nothing of the sort. Please quote me.

"There are a number of tools to "bake it" into the original image. Most people don't really "bake it", they bootstrap it using things like cloud-init. And it can most definitively bootstrap itself into an environment."

I would be interested to see Puppet/Chef bootstrap its self into an environment. It could be useful for future use cases.

"While a novel approach to automatic scaling it seems rather hacky, instead of using the tools provided by whatever provider you're using, that are built for this exact thing. I can see a number of things that can, and probably will, go wrong with that setup. Sensu having a hiccup, f.ex."

Because using what your Cloud provider gives you is locking you into that vendor. Amazon and Google both have great offerings when it comes to autoscaling various bits of your estate, but the moment you start using features they offer you, features no one else offers (so virtually all of Amazon's catalog, then), it becomes incredibly difficult to move away from that provider.

"And your assumptions about me not trying Ansible is incorrect. I've used it, and I like some of the concepts of it. But for anything other than one-off configurations or smaller environments, moving to Puppet was an easy choice. Speed being one of the biggest factor, and Puppet is slow as fuck compared to say, Chef. How do you do testing in Ansible?"

Fair enough. Then I withdraw and apologise for my bad assumption.

I've never really found Anible to be slow, because I manage networks in chunks, correctly partitioned, and fenced from each other, which means an update to a product, service, or resource is generally done on only a few servers.

If I had an estate with, let's say, 500 web servers in it and those servers were on a flat network (the same VLAN), I wouldn't dream of rolling out a change to their configuration in one big swoop. That would be a mental thing to do. Instead I would do it with chunks of servers at a time, something Ansible can do for you. I would also utilise Ansible's failsafe features, so if there are any or enough errors during a run, it will back off and stop provisioning. I'm sure Puppet/Chef can do this to.

"The last remark wasn't an attempt at "school playground", just a factual statement to counter the argument the article raised about old versions of Chef. If you can't upgrade your packages, the why are you even looking at CM?"

This is true to a point, but if you're like me, you're probably bored of the OS by now. I am. I don't want to play around with kernel params, disk issues, packages, etc... I want that abstracted away and made easier.

Also, I've seen people, friends of mine, successfully take someone from zero to hero with regards to Linux, using Ansible as a learning tool: you do something in Ansible, you look at and understand the result. It complements learning nicely.

"I never said that Ansible was insecure, or somehow more insecure than Puppet or Chef, I'm not sure where you even think I wrote that."

You said: "Security will always be an issue, thinking that Ansible solves all issues is naive."

And my response is: security is a complex matter, but one way to reduce that complexity is through the use of as little software as possible. Ansible introduces virtually ZERO new pieces of software on your network (the actual boxes you're provisioning), Puppet and Chef introduce a lot, way too much, across your entire estate, and thus is less secure by definition.


The only way I deploy Puppet today is in masterless mode, having the manifests on a private BitBucket repository. You still have an agent to deploy, but at least it doesn't run all the time, and there's no master and no mcollective to complicate things.


Despite some warts I like Puppet a lot because it's simple while still providing some flexibility to use Ruby. I prefer Ruby over Python by a huge margin and while that would seem to make Chef ideal, at the end of the day Puppet is more approachable for people with zero to minimal experience developing.


I believe this to be the only way to run Puppet, yeah.

I'm glad it's solving problems for you... that is ultimately all I care about at the end of the day.


I'm surprised that Puppet is still moving forward with mcollective. It might have been state of the art a few years back but it just feels unwieldy now. The only time it's relatively easy to deal with is in their enterprise product. Obviously they have to make a living but I'm not sure choosing to make a key component easier in your paid product and leaving it as a pain in the ass for your open source project is the best thing for customers. It feels harder to find community support and discussion because only a small subset use it and/or understand it.


This might be off-topic (or it might be a breath of fresh air if you're tired of configuration managers) --

I've been toying with the idea of making a trolling-but-no-really deployment framework called tarpipe, and all it does is take some files on your host, get em to a $place in one step, and run hook.sh. Oooooptionally, do some dir moves and symlnks to keep a prior state backed up, and service stop|start on either side of the mv/ln, to minimize downtime.

Usage could be `tarpipe ssh user@host` or `tarpipe <(echo "cd keke && bash -c")` just as easily.

It goes without saying that this simply wouldn't be comparable to Ansible, Chef, or other CM because it's too simple. It doesn't help manage state if it escapes $cwd. But if your application can curb its enthusiasm to a directory... boy is it simple if that's true.

I already do this on the daily to crashland my bashrc and dotfiles on any new remote host. Maybe this kind of explicitly zero-dep deploy would be useful for more situations.

Would anyone have a use for that?

---

EDIT: What the heck: I prototyped it: https://gist.github.com/heavenlyhash/b575092aa84ce9f3e1d2


I'm sure it'll work for some people, but configuration management is one of those things where the devil isn't IN the details, the devil IS the details.

What happens if you have binary files to move around that are big? How do you insure atomicity? What happens when a run dies midway through? Can you make sure operations are idempotent?

That's even before we get into variables, state, coordination/orchestration, and so on.

I spent a lot of time (a troubling amount, really) in the automation space, and while it's true that simple problems are easy to solve, simple problems quickly become hard problems. Then you're building state machines, binary distribution systems, and now all of a sudden you have an enterprise workflow/configuration management system.

I swear, half the products in this space started out exactly as yours, and the author just kept finding edge cases to fix until someone gave them $10m in series A funding and it was suddenly a business.


Absolutely. But there's a really wildly underserved niche where I have less computers than netflix and I really just wanna push, not have a push framework engine manager tower of turtles.

Sometimes I need kubernetes and orchestration and teleporting state snapshots and oh my.

Sometimes I just need three files on the dang VPS box.

(I see your point, of course. I wouldn't want to build orchestration in bash. Rather, I'd really like to see at least one tool that looks at all this, and goes "huh. We're gonna KISS. And no, it's never going to orchestrate 200 machines. That's okay.")


That's Ansible as far as I can tell. Roles, group_vars, dynamic inventory, etc. are all optional; pushing a few files to some VPSes is just a few lines in a YAML file.


And a sprawling python dependency on each side?

You can play the "python's everywhere" card now if you like, but I'll look at you funny.

How many megs, exactly, is the smallest thing you can have ansible push to an empty slave server with no prior contact?


I thought you were talking about simplicity/ease of use, not network bandwidth efficiency.

If you're counting megabytes in a land where low-end mobile phones have 16 gigabytes of storage and 100Mbps wireless, then you probably have a boutique enough use case that you should roll your own.


As the other respondent to this post said - ansible does this use case very well. Maybe the best if you don't already know puppet.


There's absolutely nothing wrong with a bash script or makefile for this purpose. Been doing it for years for deployment on a few small hosts I manage. The last thing I want to deal with is "configuration management" management for these lightweight apps.


> making a trolling-but-no-really deployment framework called tarpipe...

I ended up writing something _very_ like this at a previous job, where we were building out a lot of new systems but the existing Ops department were still living in the Solaris days and wouldn't let anything like Chef/Ansible into their world.

It actually worked really well, you got simple one-command deploys, with rollback and so on, which (seemed to) always work without a hitch.

The whole thing was maybe a few hundred lines of bash. That was a few years ago, and apparently that hacky little tool is still in use for all deployments.


I have been toying with something similar. Basically you write Go code and my prog is similar to "go run" except it cross-compiles it for the target server, ships itself over via ssh/sftp, runs itself, and deletes itself.

I already use this like fabric in some places and have even built in some early conf mgmt with Go templates. Haven't really open sourced it yet except for the scaffolding at https://github.com/cretz/systrument


I mean, this is basically how Ansible works under the hood, only with standalone scripts instead of arbitrary tarballs. You could probably implement it as an action plugin, and then keep using Ansible for all the hard annoying bits- the local execution framework, the ssh wrapper with all the edges sanded down, the 10k handled edge cases around running 'sudo' on arbitrary hosts...


If you're looking for a simple/minimalistic configuration management, I'll shamelessly plug my own: http://holocm.org


https://github.com/pinterest/teletraan ? Uses simple scripts for the "logic", uses a tarball for "distribution".


Some things I really like about Ansible are:

- Super simple declarative yaml configs

- Agentless. You needed to have SSH working anyway so Ansible just uses that. With ssh pipelining it's so fast.

- The community support is huge and extensive.

- They have a module for everything and development is constant and active, much of which from the community.

- Hardware and networking equipment can be provisioned just the same as a VM or OS image.

The list goes on. Definitely give Ansible a try.


I've previously used Puppet in agentless mode and now Ansible. Ansible is much easier to troubleshoot - it just executes things in order until you reach the thing that breaks. Puppet... fails in ways that you sometimes can't even tell with debug on.

Not to mention that Puppet totally freaks out if you have the exact same step configured the exact same way in two different modules and refuses to run ('just have package X installed, no further config'). Apparently you're supposed to write a whole new module at that point.

My guess is that Puppet would work better in large, complex environments as a sort of live-managed system, but down at my small scales (tens of servers), Ansible is far easier to develop and manage, I find.


>- Super simple declarative yaml configs

This is first time I am looking at Ansible and I am instantly taken aback with if conditions in yaml. Looks so strange.


I guess maybe I'm one of the few, while I've used ansible, chef, and now salt... I'd much prefer something like nixops + nix, if it were more mature.


Where is the lack of maturity exactly? Nix itself is pretty stable. I don't know about NixOps, but I suspect you may be talking about Nixpkgs, not NixOps itself?


Someone who is using Nix in production told me, NixOps wasn't ready for more complex deployments (the example mentioned was EC2 inside VPC)


We've been using Ansible to configure baremetal/VM servers as well as building Docker images. The advantage of using it with Docker is that we get to use something more powerful and readable than what's possible with a Dockerfile (which tends to quickly turn into spaghetti/unreable code).

If the application being containerized adheres to some principles like taking configuration options from environment variables or some other discovery method, then Dockerfiles are simple enough. But you if have to do anything more than a few steps, we've found it more manageable to encapsulate the knowledge in an Ansible role.

We also build our own base images from scratch using Packer (which we love) and Jenkins (not so much love there). That being said, having the required packages for running Ansible inside the image does bloat things a little bit.

If Dockerfiles had more powerful and elegant constructs we could stop using Ansible and remove a layer of abstraction. That would be great.


If you are thinking about working with either of these (or puppet) please check out salt stack it is a pleasure to work with and equally as powerful.


I have pretty extensive experience using Salt very deeply, and while it's good, I recommend it only with some pretty severe reservations.

First is that management of the master is still a big, big pain point. Unless something's changed in the last few months, you can't meaningfully cluster a salt master. Minion keys must be accepted on each master in turn. Auto-accepting minion keys is risky as hell, too. Master signing keys sound like a great idea until you realize that it means you can't bootstrap a minion sui generis.

My other big problem with salt is that Salt formulas are really popular, but are _completely_ uncomposeable. Time and time again I found places where the action of a formula needed to depend on data that existed elsewhere in the pillar, but pillars can't be parameterized with pillar data (obviously). They're also not versioned at all, so occasionally breaking changes are merged in (I myself am personally guilty of doing this). The idea is a decent one ("separate the data from the code!"), but they end up being a real headache in a complex environment.

There are a few other, more minor, issues I had with Salt, but really my main beef is that Salt is an extraordinarily effective footgun. If you try to avoid writing "salt scripts" (think: Ansible playbooks), it's very, very easy to end up with a really messy dependency tree, crazy templates, etc. Yes, I know that discipline matters, but people aren't perfect.


I agree with most of your points, but having used other systems, I just have to point out that all configuration management systems are footguns that give you armor over 50% of your foot, and let you aim how you please.

I think salt helps to enforce that you can't arbitrarily compose disparate datasources with overlapping namespaces and have it magic it's way to a sane solution.

Chef tries to compose its data via a very sensible yet hard to get right ordering mechanism. Salt says "nope, you overlap, it breaks, and you get to fix it". I prefer salt's approach but what I like more is the temptation to write an external pillar provider that will do the right thing for me when I get to that point. It's imperfect but quite powerful.


I've used Salt extensively, and found it unbelievably buggy. I'm not sure the company behind it tests it in any way, and you can expect terrible bugs in every point release.

Their bug tracker (which is 3500 open issues deep) has to be treated as documentation, and the real documentation is often a work of fiction.


I had bought into it originally because it claimed some Windows support. It did have it. But it was very limited and bug ridden. I really wanted to like it, but it just fell down at every step. Worked ok-ish on Linux systems though.


That's weird. We have used salt to manage hundreds of machines for a few years and have had no issues with bugs. However, we are only using it with ubuntu so that may be a reason we haven't encountered the bugs you speak of.

I will note that the zmq layer very occasionally seems to loose connections and you need to ssh into a machine and restart the salt minion, that's the only issue I've seen.


We use salt extensively and I have reservations even though I like it.

It has a fairly nice configuration management language at its core, but it brings so much baggage with it that it's completely overwhelming. They can't figure out what they want to do with the project, so they do a little of everything.

The sheer number of concepts and awkward terminology you have to learn to use it are off-putting. I've had to drag a couple devs along for the ride and the learning curve is big impediment to further adoption in our org.


But the learning curve is much lower than Puppet, for example, no?


Salt is awesome. Very easy to reason about, very powerful to use, very simple to control.


is it also agent-less?


It can be. Depending on configuration you can run with or without master. (https://docs.saltstack.com/en/latest/topics/tutorials/quicks...)


No, an agent runs on the host (a "minion") and calls back to a ("master") for instruction. Masters pub messages to minions with job IDs, and minions execute and respond accordingly. The minion is very lightweight. Masters less so, but they vertically scale pretty well.


Saltstack can be used without both salt-minion and salt-master daemon, it is called masterless mode.


If you'd like something simpler than both, have a look at: https://pressly.github.io/sup


This looks very nice.


I'm having a little trouble taking this article serious when they are talking about ancient versions of Chef, and that being a problem. If you are still on the < 0.9 version of Chef you are doing something very wrong. You already have configuration management, it's not really an argument to not update to a somewhat stable release. Chef isn't just a pull-based system, you have things like push-jobs to go the other way around.

ChefDK is amazing. You get all the tools to do things right. Foodcritic will make you write good code, fast. It is hard wired for environments. It has a rigid precedence for anything, that is also well documented. I mean anything.

I work with Ansible, Puppet and Chef, and Ansible is cute, perfect for one-off configurations, but it doesn't do anything chef-solo couldn't do, if you wanted it agentless. Puppet and Chef does more or less the same things, I just find Puppet extremely slow to work with (heira, r10k, different environments, etc), hard to debug and in general just slow.


So, after working with Ansible over the last 7 months, I've come to the conclusion that it's not SO bad (it is still bad) if you have some sort of frontend to it that builds your playbooks for you and then calls out to `ansible-playbook` instead of trying to write your playbooks from scratch in YAML. We have a big CLI application, written in Python, that does a lot of this for us. It provides clean, specific interfaces to our most common tasks and has wrappers around the base tools (`ansible` and `ansible-playbook`) that add some useful features we found missing from the base tooling (Want to set a nested variable without resorting to passing a JSON dictionary at the command line? `-V foo.bar.baz value`). Even though our CLI is written in Python, we still prefer to build and write playbooks to temp files before calling out to `ansible-playbook` over the Python API just due to the amount of logic baked into the module and playbook runners that we'd have to replicate ourselves. Ansible 2.0 looks to provide a more powerful Python API (and a better example on how to use it in the docs), but porting our stuff now isn't really a priority. As I started writing this, I thought this would be a much more scathing comment as I find myself terribly frustrated on a regular basis, but, as I said, it's not SO bad using a frontend written in a language I enjoy working with. As another commenter wrote, I really feel that Lisps/Schemes (or some other language with a powerful macro system) would make the ultimate ops/cms languages...Guix and Pallet seem to be the only games in town in that regard.


I have begun work on a remote deployment system for GNU Guix, which will be our Ansible/Chef/etc. equivalent. We already have the tools for performing full-system configuration management on the local machine and offloading builds from one machine to another, so it's a matter of gluing the two together. We'll offload full-system builds to a cluster of machines and then instantiate the configuration(s).


From what I know, but this is probably outdated knowledge -- Ansible is what you'd use if you already use shell scripts + ssh. Then you upgrade to Ansible. Instead of shell scripts you get nicer things.

However if you have something complicated that needs templating, 1000s of servers, secrets/passwords to distribute, then something like Salt or Chef is the way to go.

Personally I want to see where NixOS/Disnix/Guix will end up. On paper those seem very powerful and I like the idea behind them.


ansible ftw. i really regret having started with chef a few years back.


This is pure bunk. Let's go down the list, one by one:

Maintenance: Sure, chef has a server component. So does ansible, if you're using it the way he suggests (with a host periodically running ansible playbooks on all hosts). Ansible has no client component to upgrade, though, so that's a win, right? It totally is, until Ansible doesn't work on a host and you can't figure out why, and the error logs you get are useless because some of Ansible's many assumptions about what the host's initial state is are incorrect. Chef can be managed by a standard package manager, which costs nothing on the client side, and allows far, far better assumptions to be made.

For the record, I eventually gave up on the chef server, replicated the playbooks to each machine (using a cronjob and git), and chef-solo.

Speed: Ansible pipelining speeds it up, significantly. You can almost get one command a second! Chef runs on host. It is ruby, and goes slow, but I have programmed a lot of chef and run a lot of Ansible, and my average chef run was under 30 seconds, and I've yet to have Ansible run any playbook in under a minute. Some of this is from atrocious default behavior, like requiring all hosts to complete a step before moving on to the next step on any hosts, or the fact that it spends nearly 10 seconds of cpu time on each machine 'gathering facts' at the beginning of it's playbooks, even if none of those facts are ever used.

Fact caching: This is a solution to the aforementioned problems with Ansible. It may make sense in the chef-server context, but I don't have a whole lot of experience with it.

Tags: This is probably a matter of personal preference, but I prefer to give the set of things that need to be done, and have the tree descend downwards based on dependencies from there. The author clearly prefers to specify with tasks when they should happen, and for each host a set of initial circumstances. This one can be argued til you're blue in the face. I make the point that there's a clear tree that can be built of dependencies under my scheme.

Push vs Pull: There's no maintenance cost to upgrading? What is this dude on? When you change Ansible revisions, you have to do just as much work adapting as from chef revision to chef revision. Ansible has always been highly in flux, and not great about not changing default behavior.

Pulls still have to be triggered, but they can (and should) be triggered on-host, in a cronjob. Your monitoring system should alert you when the chef run is out of date, though, honestly, if it is failing on just some of your hosts you need to clean up and unify your infrastructure.

Raw numbers: Ansible scales one large machine. Chef costs you a tiny amount on each machine. One of these scales. One of these does not.

Search and inventory: Oh gods, if you're using Ansible for inventory managment, please don't. If you're using chef for inventory management, please don't. Neither are reasonable tools for the job.

Orchestration: Neither chef nor ansible are appropriate tools for dealing with your application's data model. Full stop. Actually, full stop. There's nothing else of value further down this article. Please don't take any of it's advice.


What are the cons instead of using Ansible (or others), just do simple clone of virtual disk image from some configured master? There will be still need to configure some things (IP addresses,...), but lot of things can be shared. Also one can use LayerFS to have shared things in RO layer and changes specific to clone in another RW layer (basically what Docker does). I know Ansible has somehow solved the problem of running same config scripts multiple times, but is it 100% bulletproof?


All of the responses to this post so far: "This sieve is useless! It won't hold my coffee and I get burnt every time I use it!!"


"but I prefer using git submodules" - nope.

Great tutorial!


It's weird the author uses git submodules. Ansible has a requirements file you can use for dependency management.

http://docs.ansible.com/ansible/galaxy.html#advanced-control...


It may be me but I rather like http://kubernetes.io/ as this takes the approach that you are deploying an application not a set of disparate services.


These aren't exactly comparable with Kubernetes. While there's overlap between the ultimate goals, we're fundamentally talking about two very different things.

   Kubernetes, Mesos, Swarm, Fleet, Marathon belong in one category (orchestration tools).

   Ansible, Chef, Puppet, Salt belong in another category (configuration management).


Ansible at its core is an orchestration tool that grew configuration management. It's not really the same beasty as Chef/Puppet/Salt. Salt is also pretty strong here, while Chef is basically worthless for anything outside of config management.

And maybe it's because we don't have good terminology around this, but your 'orchestration tools' include both backend and frontend tools for cluster management- Marathon sits at the bottom of Mesos, etc- it just seems like a weird group.


What does Google use for configuration management? And how do I do configuration management with Nix?


http://flabbergast.org/ is the closest, and in a similar vein you'd have https://github.com/google/jsonnet and then Nix.

Configuration management is a very different thing to what most of ansible/chef/puppet/salt are commonly used for - it's possible to do it but it's a bit of an uphill battle. The core problem of config management is not pushing out things, but rather how do you represent configs with their myriad of interacting multi-level exceptions while also allowing tens/hundreds/thousands of operational changes to be ongoing at the same time.

Ansible for example is good for about 3-4 levels of exceptions, which isn't that much considering region and environment (dev vs prod) are two of those.


Flabbergast seems to have the right idea, but its lack of examples is a problem. Its only example is based around Apache Aurora (who in hell uses that and wtf it is I don't know, and I don't really want to learn that just to understand an example). This hasn't changed since it was first announced, so I don't think I'll take another look at it for quite a while unless I come by some reasonable howto.


This is very useful, thanks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: