GOTO notes: Outage Response

04/29/13

GOTO notes: Outage Response

Escalation and Response to Outage Scenarios
John Allspaw (with Etsy)

Emergency scenarios under time pressure.

Managing the Unexpected - book

Skills, rules, & knowledge
1. Skill-based - Simple, routine; muscle memory
2. Rule-based - knowable, but unfamiliar
3. Knowledge-based - novel scenarios; trial & error

OODA Loop

Response to escalating scenarios
We assess how things are in the moment, but neglect rates (developments over time)
We have difficulty with exponential developments (accelerated change)
We tend to think in causal series, instead of causal nets. The world doesn’t work like a line-up series of dominos.

Pitfalls
Thematic Vagabonding - you’re trying to work out what’s going on, signals are coming in, but you’re flailing - context-switching
Goal Fixation - Intuition tells you this is something you’ve seen before; you’ll prove that to everyone
Heroism - non-communicating lone wolf-isms - “Hold on, I’ve got this” - not a team player - win or lose, the team loses

Emergencies are increasingly handled by teams, not individuals.

Successful teams
- know when to scatter, and when to swarm - but communicate
- stabilize the patient; we’ll figure it out later (incident resolution)
- reproducibility?
- fault tolerance - Erlang - does add complexity

Shotgun debugging - almost never works

Joint Activity
- Interpredictability - ability to predict the abilities of others involved in the activity (and sharing your own abilities/actions) - make it clear who’s doing what
- Common Ground
- Directability - deliberate attempts to modify the actions of others - also being open to direction

Improvisation
Typically have to improvise within constraints.
Needs to be practiced.

Communication
- Explicitness - be concrete, not abstract - who/what/when, Alpha/Bravo/Charlie
- Timing
- Assertiveness - passive-assertive-aggressive continuum - latter disregards another person’s perspective/signals - we want to be assertive (in the middle)
Yes-And (from Improv) - augment the other person’s comments

Airline pilots and co-pilots are co-located; they can see body language, etc.

Two-way communication is great, but takes longer (which is why developers hate meetings).

Feedback

Decision Making
Sources of Power (book) - Gary Klein - decision-making in the wild
1. What is the problem? (Sense-making) (focus on key aspects)
“Information is not scarce; attention is.” (missed the quote’s author)
2. What should I do?
Intuition
Rule-based (not good for experts)
Choice decisions (often takes more time)
Creative decisions (time-consuming, untested)

PRE-Mortems (Klein)
Before launch:
1. What could go wrong? (scenarios, contingencies)
2. It has all gone wrong. Now tell me why. (thought experiment)

Post-mortems are also extremely important.
Firing someone is almost certainly the worst thing you can do.
Human Error
- Attribution AFTER the fact
- A symptom, not a cause (you’ve stopped looking)
- “Root” cause? (myth - it will happen again)
- Useless labeling
- Using the term is the largest indicator that learning is not your goal

Mature Role of Automation
In the wake of an accident, we will reach for more automation.
- Moves humans from manual operator to supervisor (changes work; doesn’t remove it)
- Extends & augments human abilities; doesn’t replace them
- Doesn’t remove “human error”
- Are brittle (can’t adapt)
- Recognize that there is always discretionary space for humans
- Recognize the law of stretched systems

Law of Stretched Systems
Anti-lock brakes - people drive faster!

Procedures & checklists are written so that they make sense to the author at the time they were written.
When Canadian geese fill your engine, do you look for the procedure?


Your Host: webmaster@truewill.net
Copyright © 2000-2013 by William Sorensen. All rights reserved.