dyanacek's comments

dyanacek · on Oct 10, 2021

Yes! LIFO is a fantastic improvement. This suggestion is buried in the article a bit. Maybe I should have elevated it, or maybe broken it into more pieces. There’s so much to talk about on this topic. But yeah, LIFO is totally “This one weird trick that will make your service bulletproof to overload! Chaos Monkeys hate it!”

tome · on Oct 10, 2021

I'm really confused about how and why this works. Why is it a good idea to keep really old requests unhandled? I feel like I must be missing something obvious.

8note · on Oct 10, 2021

A queue distributes the latency increase to all requests whereas a stack only increases the latency for some requests when you're overloaded.

This means that if you catch up to the incoming new requests, a queue keeps every request running slower due to the time spent in the queue.

The steady state stack on the other hand, gives the same couple items worse and worse latency, while all the new items go back to normal.

Chances are the long latency requests will be retried, and there's a roughly fixed number of items to be retried, so you don't have to worry too much about the ones stuck in the stack. For a queue, the retries lengthen the queue, and that waiting time is added to every item, making them more likely to retry too, lengthening the queue further etc

Having the stack leak items at the bottom gets the same benefit - you don't have to fulfil them, but if you can get the items back out of the stack quickly, it's still worth working on and completing them before the client needs to send a retry. The more of them you get in, the more 9s you get to look more like your p50 than your p100

lokar · on Oct 11, 2021

Consider the two steady states:

You are keeping up and the queue is mostly empty, the order does not matter

You are not keeping up, the queue is growing. If nothing changes, the age of items removed from the front will grow and grow, and eventually all be timeouts or abandoned.

tome · on Oct 11, 2021

Oh, I think I see. With a queue the failure state is that everything times out. With a stack you risk sacrificing only the oldest requests and keep the newest alive.

lokar · on Oct 11, 2021

And you end up not servicing some random sample that get buried in the stack before you can get to them.

oriolid · on Oct 10, 2021

It guarantees that in overload situation the requests that get handled are handled quickly. In the same situation a FIFO would grow until all requests are really slow or time out without increasing throughput. The reason to keep old requests in LIFO instead of dropping then right away is that they can be served when load drops, just in case there's still someone waiting for the page to load.

tyingq · on Oct 10, 2021

I suppose the older requests are less likely to have actual people doing a page reload/retry that stacks up yet more demand.

dyanacek · on Oct 10, 2021

These remain as great techniques! Even iptables like you mention - it’s extremely good at cheaply shedding new handshakes, vs later on in processing the request. You lose a little visibly, but it’s a powerful outer “layer of the onion”.

And good callout on middle boxes. Even high level abstraction ones like Amazon API Gateway. In fact this is my favorite feature of it. API Gateway can reject a very high rate of excess traffic for a small overloaded service behind it.

dyanacek · on Oct 10, 2021

Agreed - utilization is an important consideration here. The capacity red line will still be there, but when load shedding is effective, the impact of crossing that red line is les. It’d be an error rate linearly proportionally to the excess, rather than the service falling off a cliff. But for the services this is talking about, neither case is okay, so we put a ton of emphasis on auto scaling models to make sure we don’t get into the situation.

A key sort of “continuation” to this article is the one on fairness: https://aws.amazon.com/builders-library/fairness-in-multi-te... . This gets into the topic of utilization a bit more.

But you’re right - good load shedding gives a business a tool to make an easier trade off when it comes to capacity management. A slight error rate until autoscaling kicks in is an easier pill to swallow than a worse outage.

dyanacek · on Oct 10, 2021

It’s a tricky topic because load shedding is a last resort that kicks in when there’s already a problem. So until auto scaling catches up or the issue is mitigated some other way, we try to make as many customers happy as possible, rather than making everyone equally unhappy.

dyanacek · on Oct 10, 2021

This is a great way to describe it! I gave a similar example of pagination and how the later pages might be better to prioritize over initial pagination requests, but your example is a nicer illustration. Thanks for that!

There’s also someone I was talking to after writing the article who said they can fall back to statically rendered versions of certain pages on Amazon.com during overload. The trick is to have a page that is still useful!

And for the “turning off features” idea - this happens today on Amazon.com. If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render.

tyingq · on Oct 10, 2021

Ah, yes, you're right...I missed the pagination example fitting that pattern.

"If a feature on the site fails to render successfully or on time, it’s left off of the page. Critical functionality can be left off, so it’s a judgement call on what’s allowed to fail the page render."

Oh, that's useful also, but I meant a step farther where the page doesn't ask for those widgets if (load > X). Which avoids calling it at all.

dyanacek · on Oct 10, 2021

Good point around avoiding the call in the first place. This is a very tricky topic, I’ve found. Things that try to guess the nuanced health of a dependency can lead to outages when they guess wrong. These circuit breakers are helpful if they’re right, but harmful if they’re wrong.

For example, say a service is backed by a partitioned cache cluster, where the data is hashed to a particular cache node. Now let’s say one node has a problem, causing requests to data that lives on that node to fail, but others to succeed. If a client is making requests for data that happens to live on all nodes (the client doesn’t know about these nodes by the way, it’s just an implementation detail of the service) and sees an increased error rate, should it start failing some requests? It could take a single partition outage and increase the scope of impact into a full outage.

Anyway I’ve been meaning to write an Amazon Builders’ Library article on this topic, or to convince someone else to do it (looking at you, Marc Brooker!)

dyanacek · on Dec 7, 2019

Good point. We should use more images. I’ll think about where we can add some for clarity! Now that we’ve done a bunch of talks at reinvent in these topics, we should have some that we can incorporate easily.

dyanacek · on Dec 6, 2019

One of the authors here. I agree with all the folks’ suggestions here! When I was in college, I took extra writing courses beyond what was required for an engineering degree (they counted toward general humanities credits I think). People always told me how important communication and writing was in engineering. Not sure the courses are what did it for me entirely, but it certainly taught me concise “technical communication”, as well as story telling, and forced me to practice! A huge key to good, clear writing is reviewers. I ended up essentially rewriting some of the articles multiple times based on feedback. And we all had some great editorial help on these articles to give the articles that final polish.