I think the next major outage will be scrutinized heavily for sure. And I have a hunch the next outage will be in the next couple months — globally-observed events with spikes in traffic that have second-level specificity tend to put huge strain on very complex systems like most modern social media sites.
Teams of people spend weeks doing capacity planning in anticipation of events like the World Cup or New Year's typically. World Cup specifically is a notorious SRE nightmare for big social media sites — you'll have people from all over the world posting at the exact same time (like, to the second/minute) when exciting things like goals happen and they'll be posting photos and video clips and excitedly spamming a million tweets.
The #1 most common cause of major site outages is increased load by far. You've got a complex system with a million little tiny gears, and if one gets overwhelmed and starts to slow down, or if a disk or something fills up, or too many things are connected to a database, or whatever then the whole thing catches on fire in spectacular ways
having everyone in the world tweeting about an incredible save, or a shitty call, or a ridiculous goal all at the same time means the traffic is super concentrated and super high. A lot of work has to go into preparing to keep things online and actively putting out fires while the events are ongoing. I've heard engineers from Instagram talk about how they always have a miserable New Years bc a lot of things always break with the increased load. And it's just not possible or practical to anticipate every potential failure mode.
Long ramble, but i'd expect it to be a slow collapse until it isn't. It can't stay online with a skeleton crew forever
I think the next major outage will be scrutinized heavily for sure. And I have a hunch the next outage will be in the next couple months — globally-observed events with spikes in traffic that have second-level specificity tend to put huge strain on very complex systems like most modern social media sites.
Teams of people spend weeks doing capacity planning in anticipation of events like the World Cup or New Year's typically. World Cup specifically is a notorious SRE nightmare for big social media sites — you'll have people from all over the world posting at the exact same time (like, to the second/minute) when exciting things like goals happen and they'll be posting photos and video clips and excitedly spamming a million tweets.
The #1 most common cause of major site outages is increased load by far. You've got a complex system with a million little tiny gears, and if one gets overwhelmed and starts to slow down, or if a disk or something fills up, or too many things are connected to a database, or whatever then the whole thing catches on fire in spectacular ways
having everyone in the world tweeting about an incredible save, or a shitty call, or a ridiculous goal all at the same time means the traffic is super concentrated and super high. A lot of work has to go into preparing to keep things online and actively putting out fires while the events are ongoing. I've heard engineers from Instagram talk about how they always have a miserable New Years bc a lot of things always break with the increased load. And it's just not possible or practical to anticipate every potential failure mode.
Long ramble, but i'd expect it to be a slow collapse until it isn't. It can't stay online with a skeleton crew forever