Why do so many put their resources in AWS us-east-1 when that's the only one (that I'm aware of) that has ever gone down?

shalafi@lemmy.world · edit-2 1 day ago

Why do so many put their resources in AWS us-east-1 when that's the only one (that I'm aware of) that has ever gone down?

criss_cross@lemmy.world · 1 day ago

I feel like this is a bit of a loaded question that needs peeling back, as you can see from other replies.

For some context, I worked at AWS for a few years and worked at other companies of various sizes. I’m not breaking NDAs or anything here but I say this to say I’ve had a fair amount of exposure to this problem.

Also one thing I want to clear up. Other regions do go down for various services. You just don’t hear about them because they don’t have the catastrophic blast radius that us-east-1 does.

So let’s start with the external company part. “Why don’t others put resources in other regions?”. So let’s say you’re starting a new company. You’ve prototyped and built a web app. Most likely that app consists of the following components

a server to respond to requests
a database
a worker for asyncronous tasks (think sidekiq/active job for the rails folks)

Often at the start both items are just on the same box. This works up until you have a large web presence and suddenly can’t throw enough hardware at the problem to make it go away (for those that want the more technical term this is called “vertical scaling”).

So cool you want to take this app and make it regional now. There’s a lot of gotchas that come around from this that can bite you if you weren’t already accounting for this. I’ll list a few but there are numerous ones:

If you have your server write to a temp file and then read it in something else (like a worker) this won’t work when they’re not on the same box anymore. You need to put this in something like S3 or some intermediary that both have access to.
You have to be careful with how you partition requests to specific regions and make sure there isn’t anything local that’s gonna break if a user decides to access your app in one region, then takes a vacation into another.
The big thing: If you’re used to having 1 centralized database there are bad assumptions you can make that are hard to break out in code. A big one is this example:

POST /comment # writing a new comment
GET /story # loads all comments

If you work in a single db setting that GET will always get you the new comment a user made, but when you start doing some of the techniques to horizontally scale (read replicas, move to dynamodb, etc), THAT’S NO LONGER A GUARANTEE. You may be reading from a completely separate box that hasn’t gotten your change yet. There are guards + patterns for handling this but it’s not cheap to switch to these for old parts of the code base.

If your a start up that’s taking the plunge this is a large cost between re-architecting and code changes. It’s one you should absolutely incur. But it’s not cheap.

If your a massive company that’s not doing this then you’re playing with fire and deserve the pain you’ve wrought. Though I’d say most of the time large companies are doing this there’s just one small service that’s globally hosted that no one thought was important that actually was a critical part of the tool chain.

So let’s say your company has taken that cost and has done everything by the book. You can still get boned even when you don’t think you are.

At AWS there are a handful of “core” services. These services are the critical building blocks of everything at AWS. Think like EC2, Lambda, S3, and DynamoDB. A lot of internal and external training works towards having SDEs build almost everything with these key components (at least in some parts of AWS, there are others that use different tool chains. It’s a massive company I can’t pigeonhole everyone here). If you read a lot of their marketing slop you can see they encourage customers to use these as well.

Even if you don’t use any of the above, there’s a good chance that a service depends on one of the above anyway. I’ll give an example everyone can check. Let’s say you are starting out and building a brand new service. You wanna keep it simple and just make an EC2 box to keep your dependencies small. You make some code in CDK (amazon’s newer IaC tool) to make this box and go to make your first deploy. One of the first steps in this process is taking your artifacts and writing it to an S3 bucket in your region. If you wanna make deploys you now have an S3 dependency.

So if one of these massively goes down in a region it’ll most likely take other things with it.

Now let’s say you’re one of the companies that are doing all the right things and have a perfect region failover plan. Well you can still get hosed as there are certain services (like IAM and I think Route 53?) that are still globally hosted in us-east-1. Now if us-east-1 goes down your IAM goes down. And now you have issues even when you did everything by the book. I think they are trying to get rid of that issue but I have no idea (and I wouldn’t say even if I did lol). Even if it’s not us-east-1 I can guarantee there’s probably some other small things in other hosted regions that would have catastrophic effects like this.

TL;DR - shit’s hard. You can do everything right and still get fucked by this.

irq0@infosec.pub · 1 day ago

I believe us-east-1 is the default region so it’s probably a case of devs not changing their region unless they need to.

Also, 1000s of companies use AWS. In issue in any of their regions is likely to have significant impact on internet services

shalafi@lemmy.world · 1 day ago

East 1 is the default, so you have something there. Not sure why a fault in any given region would affect others? My PiHole didn’t go down and I would surely notice a lack of DNS. :)

chaospatterns@lemmy.world · edit-2 8 hours ago

Some people are asking why other regions seem to be affected when us-east-1 goes down. Why aren’t they separated out? I used to work in AWS, but will speak generally.

First, it’s important to understand the concept of a control plane vs a data plane. Amazon and other big scale companies often talk in terms of control plane/data plane separation because those two concepts have wildly different scale and requirements.

A control plane is the side of your service that handles the administrative functions of a service. For example, AWS S3 service would separate out bucket creation and deletion work from the file create/edit. In Route 53, this would be creating and editing zones. In IAM, it’s the creation of AWS access keys for IAM users. IAM Roles, IIRC, work differently and can function more in the data plane.

A data plane is the side of the service that handles the main meat and potatoes of a service. For example, AWS S3 any object key creates, edits, deletes would all be part of the data plane. In Route 53, these would be any DNS record query. I don’t know if updating a record was considered a data plane call or not.

These are separated out because data plane generally massively dwarf the number of calls for administrative APIs. It’s also done because control plane calls often times have some extra complexities. Like in Route 53, to create a zone means you need to go find n different name servers that can handle a given domain name without overlapping with another customer, you need to tell them that they should now handle calls, you need to get the records to those servers running all over the world.

The fact is Route 53 is globally replicated and they need to have a source of truth and engineering culture pushes Amazon towards a pull based approach. If a user creates a zone in eu-west-1, they still expect it to be on servers all over the world, so how do you get it there? Well, AWS takes the approach that certain services can have a single region dependency for their control plane in the case that it’s infeasible technically or to the business to avoid one, however the data plane of the service can’t have that dependency.

over_clox@lemmy.world · 1 day ago

Don’t quote me on this, but I think I read earlier this morning that all of AWS has to go through us-east-1 to verify site certificates.

I’m not sure though, I was rather sleepy at the time I think I read something along those lines. 🤷

ryannathans@aussie.zone · 1 day ago

IAM needs to go through US east and everything needs IAM. And I assume IAM went down due to dynamo going down in US East?

This is literally the tech stack meme with one tiny block holding it all up

shalafi@lemmy.world · 1 day ago

Well fuck me that explains it all. Not a simple fix at this late date.

chaospatterns@lemmy.world · 1 day ago

This is a little misleading. It does not mean that every single region depends on us-east-1 to authenticate every API calls. That would be insane and obviously mean that every region has a dependency on us-east-1.

Instead, us-east-1 is what’s called a partition leader. It holds the secret key material for everything in the commercial partition and regularly it distributes that to other regions. So if it’s down for an extended period of time, other regions IAM can be impacted, but then there’s some other complexity with STS endpoints. You can actually see the by product of this if you look at how the SigV4 signing algorithm works. Each HMAC layer is expanding the key scope.

Anyway, this part of IAM is pretty battle tested and from I saw not the cause of today’s outage.

ClanOfTheOcho@lemmy.world · 1 day ago

Just my own theory, but my observations are that us-east-1 is often a little cheaper than other regions, plus they have access to the latest resource types.

shalafi@lemmy.world · 1 day ago

Thought the prices were identical? I was in DevOps at my last company and while I hardly touched AWS, there was no discussion of variable pricing.

ClanOfTheOcho@lemmy.world · 1 day ago

Compare Virginia with California. This chart is specifically for EC2, but I believe the trend extends to other resources. The difference are larger when you start looking outside the US. And, if you weren’t aware, AWS also offers reserve and spot EC2 instances for savings relative to on-demand instances.

chaospatterns@lemmy.world · 1 day ago

N. California as a region can’t grow and it’s priced accordingly. Instead, compare US East (Ohio) or US West (Oregon) for a region that’s price competitive. A lot of Amazon internal stuff was starting to move to US East (Ohio) because it was geographically close, but a lot less problematic.

MCMXCI@mimiclem.me · 1 day ago

I had a long comment but @criss_cross added more context than I could. To summarize, there is infrastructure in us-east-1 that can take you down even if you host in another region. Also lots of stuff there and closer=faster=better.