Etcd v3: increased scale and new APIs

sandstrom · on June 30, 2016

Sounds interesting! What are the benefits of using Etcd over Consul? (https://www.consul.io/)

ideal0227 · on June 30, 2016

I think the two projects focus on different things.

Consul provides features like health checking, failure detection besides its consistent key-value store. It aims to provide an all-in-one solution[0].

etcd focuses on the consistent key-value store. The key-value store has more advanced features like multi-version keys, reliable watches, and provides better performance. People build additional features on top of etcd's key-value store/raft.

(I work on etcd)

[0] https://www.consul.io/intro/vs/index.html

otterley · on June 30, 2016

Do you have proof of the "better performance" claim?

ideal0227 · on June 30, 2016

Consul: https://github.com/hashicorp/consul/blob/master/bench/result...

etcd: https://github.com/coreos/etcd/blob/master/Documentation/op-...

Note that the testing environments are not exact the same in the doc I listed, but comparable. Also Consul performance is improved after a few releases.

So we did the benchmark internally on the same environment. The result is still comparable to what I listed in the two official docs.

The best way to compare performance is still probably to run the benchmark on your own environment.

voidfunc · on June 30, 2016

I've used Consul but I haven't used Etcd directly, besides using via Kubernetes. One benefit IMO is the number of high-profile eyeballs on the project due to the success of K8s and the pedigree of engineers working on it.

lisivka · on June 30, 2016

k8s works with etcd only. Consul is better in term of features.

gshx · on June 30, 2016

Does someone have any benchmarks or comparisons with Zk? We have run it for many yrs without a problem and are very happy with it. Would be interested in hearing from anyone who switched from zk->etcd for distributed locking, presence, leader election type idioms and ran it in prod for a few months/yrs and their takeaways.

ideal0227 · on June 30, 2016

etcd3 has similar performance compared to ZK for small scale. For large dataset, etcd3 does better since it does incremental snapshot/ smaller memory footprint, when ZK does full snapshot that takes a lot of resources. For watches, etcd3 does streaming + TCP multiplexing to save memory.

embiggen · on June 30, 2016

I too have been using ZK for many years now, and it's pretty great.

Etcd can provide faster election notifications but it comes at the cost of etcd still being pretty new, so be prepared to get cut by the bleeding edge :)

Randgalt · on June 30, 2016

etcd is a key value store that has features of ZK. ZooKeeper is strictly distributed coordination. Apples and oranges, no?

atombender · on July 1, 2016

Apples all the way. Etcd is pretty much a clone of ZooKeeper in Go; they both support hiearchical keys, atomic autoincrement, watches, though Etcd uses the Raft consensus algorithm, whereas ZooKeeper uses its own homegrown algorithm, and there are other minor differences. Both are intended for configuration management and coordination. (Of course, both are ultimately clones of Google's internal tool, Chubby.)

embiggen · on July 1, 2016

I agree that they are similar in some ways, but under the covers they are fundamentally different beasts in almost all ways!!!

Etcd uses Raft and ZooKeeper uses it's own protocol called Zab [0]. Zab shares some characteristics with Paxos but certainly IS NOT Paxos.

[0] https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs...

atombender · on July 1, 2016

As I understand it, both Paxos and Zab will result in similar performance characteristics, since writes need, by design, to be coordinated with peers and serialized in a strict manner. In this sense, Etcd and ZK are very much alike, irrespective of how they are implemented internally. I wouldn't be surprised if Etcd was found to be faster and more scalable than ZK, however.

ideal0227 · on July 1, 2016

It really depends on how you view the problem. Yes, the latency of agreeing a proposal is similar, which is limited by physical (network latency + disk io). However, there are ways to put more stuff into one proposal (batching) and submitting proposals continuously (pipelining). These optimizations highly depend on the implementations.

jonaf · on July 1, 2016

I believe ZK implements Pacis. Maybe someone not on mobile can correct me / provide reference.

Randgalt · on July 1, 2016

ZK is zab - similar to Paxos but not Paxos

gshx · on July 1, 2016

Your point is well-taken. This qn comes up often enough for similar use cases and hence I asked if someone has data to support an arg. Considering the criticality of Zk in the larger stack and our success and operational expertise with it, it has to take significant convincing to switch to something like etcd.

The following two paragraphs from the etcd project site seem to hint that they're trying to target overlapping use cases: "Your applications can read and write data into etcd. A simple use-case is to store database connection details or feature flags in etcd as key value pairs. These values can be watched, allowing your app to reconfigure itself when they change.

Advanced uses take advantage of the consistency guarantees to implement database leader elections or do distributed locking across a cluster of workers."

Perceptes · on July 1, 2016

It's not mentioned in the blog post—is there a document that explains the migration plan for the etcd that ships on the host in CoreOS?

Edit: They haven't been published to the docs on the CoreOS website yet, but there are two documents listed under "upgrading and compatibility" at the bottom of https://github.com/coreos/etcd/blob/v3.0.0/Documentation/doc...

philips · on July 1, 2016

Great question. The upgrade path is a rolling upgrade from v2.3.y series to v3.0.0 series. This is how all of the etcd upgrades have worked since the start of v2.x.y.

Doc is here: https://github.com/coreos/etcd/blob/master/Documentation/upg...

Perceptes · on July 1, 2016

Seems worth noting as well that this only upgrades the version of the cluster. Data populated via the v2 API will not magically be available via the v3 API as they have separate data stores/keyspaces. https://github.com/coreos/etcd/blob/v3.0.0/Documentation/op-... talks about how to migrate data that was stored with v2 to v3's data store.

Ygor · on June 30, 2016

Etcd looks more and more promising as its usage and development activity increases. Anyone using it internally as a standalone part in the system (e.g. not just for k8s or coreos)?

Using e.g. gRPC shows great promise, but systems like ZooKeeper still play nicer in more traditional Java shops, or do they? How hard is it to use etcd from the JVM?

jzelinskie · on June 30, 2016

There's a project called zetcd[0] that acts as a translation proxy in front of etcd to let you use the ZooKeeper API. I don't know if it's production ready, but I do know it works and is a pretty cool idea.

[0]: https://github.com/chzchzchz/zetcd

ideal0227 · on June 30, 2016

It should not be hard. For v2 API, there are several Java bindings (https://github.com/jurmous/etcd4j).

For the gRPC API, it is easy to generate a Java gRPC client based on the defined service. We have plans to make that experience better.

bogomipz · on July 1, 2016

Etcd looks more promising?

Zookeeper has been around now for over 5 years with an extremely large install base.

Kafka, Hadoop, Solr, Mesos and Hbase projects that leverage Zookeeper distributed coordination.

Zookeeper has already delivered.

harlowja · on July 1, 2016

I would tend to agree, knowing what zookeeper has been doing and actually using zookeeper (and etcd) I can say that the API and the primitives offered by zookeeper are IMHO better (although this multi-version concurrency control model is interesting) and more mature.

It feels like etcd is 'still discovering itself' for lack of better words.

Btw: https://issues.apache.org/jira/browse/ZOOKEEPER-2169 (this is the equivalent of TTLs for zookeeper).

cbsmith · on June 30, 2016

Seems pretty easy to use etcd from a JVM, particularly now that it is gRPC based.

ashwinaj · on July 1, 2016

I'm not sure if I can disclose our product details, but we're using etcd2 as a consensus algorithm for a data storage cluster.

Randgalt · on July 1, 2016

Are there high level APIs for etcd like there are for ZooKeeper? I'm the main author of Apache Curator and I know that writing "recipes" is not trivial.

ideal0227 · on July 1, 2016

Yes. Here: https://godoc.org/github.com/coreos/etcd/clientv3/concurrenc...

It would be great if you can provide opinions, comments or help on these high level APIs. We also might move these to an internal proxy layer, so that other clients in other language can use it more easily.

(some more here: https://github.com/coreos/etcd/tree/master/contrib/recipes)

Randgalt · on July 1, 2016

I assume there's a Java API for etcd? If so, it would be interesting to try to port Curator.

ideal0227 · on July 1, 2016

We are working on it (https://github.com/coreos/etcd/issues/5067). Probably we could work together on the Java client first? It should not be hard given that gRPC supports Java.

agentultra · on June 30, 2016

Do they publish formal specifications of these distributed algorithms?

Has anyone verified the implementation?

Zookeeper is at least based on Paxos which has a TLA+ model one can check.

ideal0227 · on June 30, 2016

etcd is based on Raft. Raft has a TLA+ spec. But do note that the implementation usually diverges from its algorithm [0].

For etcd, we try to keep the core algorithm as self-contained and deterministic (no I/O, no timer) as possible. So it can be very close to the pure algorithm. We are very confident about it since we throughly tested it, and the implementation is shared with other consistent large scale database systems too (cockroachdb, tikv).

ZooKeeper uses ZAB [1] under the hood. I do not think there is a TLA+ for ZAB.

[0] https://www.cs.utexas.edu/users/lorenzo/corsi/cs380d/papers/...

[1] http://www-users.cselabs.umn.edu/classes/Spring-2016/csci821...

harlowja · on July 1, 2016

Do you (or people you know?) plan on publishing peer reviewed papers for how the lease algorithm works, how the multi-version concurrency control model works (and so on)?

None of these are afaik things built in to raft but are add-ons that etcd has created (that may or may not even use raft).

It'd be great to instill confidence in these add-on features by having them peer-reviewed in a similar manner as raft is/was/has been.

agentultra · on June 30, 2016

Raft - cool!

I'm aware that implementations diverge or do not directly synthesize from their specifications. At least not yet.

I'll be poking around more to see if the etcd team has published their own specs for the novel parts of the system. I'm particularly interested in seeing if/how OSS projects are adopting more rigorous "engineering" practices.

carterschonwald · on July 1, 2016

I took a look at the etcdv3 raft.go code, it is indeed pretty nicely written! Definitely better than any other one i've seen.

(though i'm pretty excited about the raft impl my summer intern and I are hacking on currently :) )

cstrahan · on July 1, 2016

In Haskell, I presume? Will it be open sourced? :)

carterschonwald · on July 1, 2016

Actually doing in Agda first. Coinductive programming is a delight with Agda 2.5.

But yes that's the plan.

sagichmal · on July 1, 2016

Consul's Raft implementation is leagues better than etcd's, unfortunately.

justinsb · on July 1, 2016

I'd be very interesting in hearing more about this.

I tried using both the Consul and etcd Raft implementations as libraries. I found the Consul library much easier to interface with. But it was my impression that the etcd library was much more tested in the real-world, with big projects like Kubernetes and with the library being embedded into projects like CockroachDB. I also wasn't sure if the details that the Consul implementation was hiding were actually important.

eis · on July 1, 2016

Is that still the case with Etcd v3? I remember reading somewhere that they were rewriting the Raft implementation in Etcd a good while ago.

emirozer · on June 30, 2016

It's based on Raft : https://github.com/coreos/etcd/tree/master/raft

https://raft.github.io/

philips · on June 30, 2016

We also do a ton of functional testing as code is merged to test lots of different partitions and faults. You can read more on this post: https://coreos.com/blog/new-functional-testing-in-etcd/

sciurus · on July 1, 2016

Zookeeper does not use Paxos. It uses its own protocol, Zab.

https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs...

qwertyuiop924 · on July 1, 2016

This doesn't look like feature creep, but coreos bought into systemd big time, and with etcd being used in more places, the temptation of feature creep grows... and that's worrying.

hn_rate_limiter · on June 30, 2016

Will CORS ever be enabled for the https://discovery.etcd.io/new endpoint?

philips · on June 30, 2016

This is the GitHub issue for this service. Please +1 the thing on GitHub and we will try and get it fixed in prod: https://github.com/coreos/discovery.etcd.io/issues/12

meta_AU · on June 30, 2016

Wait... there are maintainers that encourage +1s all over their GitHub issues?

jvoorhis · on June 30, 2016

Probably referring to the new "reactions" feature. https://github.com/blog/2119-add-reactions-to-pull-requests-...

kpcyrd · on June 30, 2016

It's possible to +1 comments now without adding a new comment (and notifying maintainers)

cryptica · on July 1, 2016

Last time I looked into Etcd, I got the impression that it was great at handling a high-volume of read operations (theoretically read-scalability) but bad for handling high-volume of write operations (since every write has to be propagated to every node in the cluster via Raft)? Is this still the case in v3?

AYBABTME · on June 30, 2016

[flagged]

dang · on July 1, 2016

We detached this subthread from https://news.ycombinator.com/item?id=12011247 and marked it off-topic.

hn_rate_limited · on June 30, 2016

Cracks you up?

This a feature that has been missing since etcd's introduction and this article on a major new version makes no mention of it.

Are we really so speech-restricted here that we can only discuss a press release in the company's own terms?

The question is highly relevant to the etcd product and its future.