AWS outage on Tuesday caused by employee error


(jkp) #1

(Dan P Parker) #2

That’s what you call your basic, “Better update your resume” event.


(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy) #3

I’ve seen it go both ways…

i.e.,

  • sometimes a reprimand followed by a layoff / termination without cause.
  • and sometimes … no repercussions at all.

(Dan P Parker) #4

Yeah…but the proactive step is well-advised anyway.


(Micheal ) #5

Having worked (and currently work) in health and technology sectors where lives and money are involved, I, ironically, have rarely see these types of things as ‘resume generating events’. Amazon, like any DevOps organization, has strict change management processes with peer review. The human factor is ALWAYS the weak spot, and if the leadership is worth their weight, they understand that ‘mistakes happen’. If all of the processes were followed and it was human error they will find a way(through process and technology) to eliminate that completely.

Now, if change management wasn’t followed (or ignored) and this particular person has made this type of mistake in the past, I will almost bet they are in Pioneer Square Starbucks in Seattle updating their resume and networking with their LinkedIn contacts. There are no lack of jobs in Seattle (I just came from there), so even with this type of mistake, anyone coming from Amazon (especially S3) can get a job in about 20 minutes, if they didn’t already have something already set up (most people do there).

I wouldn’t weep for this person for very long…I would be more concerned with the person’s manager…if this were a fatigue or training issue, it is usually the manager that gets the can, not the employee…sends a bigger message


#6

Friday night at the bar … “How was your week? Awesome, ate sushi, blew up the Interwebs. It’s all good.”

But seriously, may be human error but it sounds like they needed better procedures and failsafes anyway.

I’m so glad I’ve never made a mistake :slight_smile:.


(ActionTiles.com co-founder Terry @ActionTiles; GitHub: @cosmicpuppy) #7

Of course, Amazon could just be saying that it was “an employee error”…

The certainly wouldn’t admit that they have a major flaw in their fundamental architecture!

For PR purposes, they have to choose how to spin the outage. Doesn’t mean the spin is the truth.


(Realy Living Dream) #8

They don’t want to admit Putin had it hacked to release top secret conspiracies to WikiLeaks


(Tony - SmartThings Unpublished Contributor ) #9


(Dan P Parker) #10

The only process that can completely eliminate human error is complete elimination of humans from the process.


(Bobby) #11

Alexa, turn on the billing server…See is that easy!


(Micheal ) #12

Per full explanation directly from Amazon that is exactly what they are going to attempt to do…while humans will still have to push a button, it sounds like they will check for the WRONG button being pressed :slight_smile:


(Dan P Parker) #13

Completely eliminate humans from the process? It sounds like some humans need to update their resumes.

Oh, wait…


(Benji) #14

And they didn’t even update the status page initially… Where are all the ST haters at?! :smile:


(Marc) #15

The best part of the post mortem was that their status page wasn’t accurate because it depended on s3 that wasn’t taking advantage of cross region s3 replication. Ironic that even AWS didn’t have full resialiancy for their own infrastructure. Guess we can’t get too mad when ST status pages aren’t accurate :slight_smile: