Right now we run everything in the US-East region, and we have all our services balanced across 4 availability zones. If there's a problem in a single AZ, it will affect every layer of our system, but only about 25% of the hosts in that layer. Some of our services are automatically resilient and can handle that easily. Others aren't so great, but we're working on more automatic failover.
When we need more servers for an auto-scaled service, we open spot requests and start on-demand instances at the same time. For most services, we want to run about 50% on-demand and 50% spot. We have a watchdog process that continually checks what's running. It launches more instances whenever there aren't enough, and terminates instances when there are too many. So if the spot price spikes and a bunch of our spot instances are shut down, the watchdog will launch replacement instances on-demand. It will also request more spot instances once the price has dropped back to normal. In reality we don't often run into spot capacity issues -- maybe once a month, and it's almost never apparent to our users.
When we need more servers for an auto-scaled service, we open spot requests and start on-demand instances at the same time. For most services, we want to run about 50% on-demand and 50% spot. We have a watchdog process that continually checks what's running. It launches more instances whenever there aren't enough, and terminates instances when there are too many. So if the spot price spikes and a bunch of our spot instances are shut down, the watchdog will launch replacement instances on-demand. It will also request more spot instances once the price has dropped back to normal. In reality we don't often run into spot capacity issues -- maybe once a month, and it's almost never apparent to our users.
I spoke about this in detail at AWS re:Invent last month, and the full talk is available online here: http://www.youtube.com/watch?v=73-G2zQ9sHU