AWS is down

crime [she/her, any] · edit-2 3 years ago

AWS is down

thethirdgracchi [he/him, they/them] · 3 years ago

It's so funny to me that every time this happens it turns out Amazon hosts like 80% of its own services on us-east-1 and it's just us-east-1 that's down. Why they have not diversified their own infrastructure to even just us-east-2 is beyond me.

hexaflexagonbear [he/him] · 3 years ago

It's probably diversified, there's just a single point of failure that's difficult to get rid of, probably a network load balancer or something.

thethirdgracchi [he/him, they/them] · 3 years ago

But like, why not move that outside of us-east-1? That's literally always the one that goes down. Just move that single point of failure outside of the most used region by a large margin on AWS. Seems like a basic reliability engineering practice.

hexaflexagonbear [he/him] · edit-2 3 years ago

Each site has a network load balancer, I mean if it was an AWS wide one, it probably wouldn't be a single region that goes down. And again, this is just my guess. Having a hardware load balancer is how the place I worked at managed traffic between several server rooms, and it was usually the main culprit for downtime. It's a very handy device so your network doesn't get overwhelmed by traffic spikes, but afaik really hard to make redundant.

thethirdgracchi [he/him, they/them] · 3 years ago

Oh I see what you mean. Yeah I guess it would make sense that us-east-1 goes down often since it's the most trafficked and it does have to have physical hardware that can fail.

all2well [he/him] · 3 years ago

It’s more that AWS customers build stuff overwhelmingly in us-east-1

eduardog3000 [he/him] · edit-2 3 years ago

Why do they get a choice in a specific server? Sure they could specify us-east, but why let them pick us-east-1 specifically?

all2well [he/him] · 3 years ago

That sort of thing already exists, as availability zones. Things like EC2 instances live in us-east-1a, us-east-1b, etc, which are comprised of separate data centers. In theory that should provide resilience to even a large-scale outage, but evidently that's not foolproof.

The reason why they let you be very precise with how you provision servers is because some applications require that servers be physically close together, especially high bandwidth stuff.

hexaflexagonbear [he/him] · edit-2 3 years ago

deleted by creator

Speaker [e/em/eir] · 3 years ago

In particular, the control plane for Route53 (their DNS product) lives entirely in us-east-1, so the blast radius of a bad outage in that region is enormous.

BlueMagoo [comrade/them] · edit-2 3 years ago

deleted by creator