>> Thank you, Jeremy.
And the opening slide is, I guess, something that most of us, as developers, and even as
probably product managers, must have done.
So show of hands for anyone who has kind of run either this or a Drop Table on a production
or non-production system.
Show of hands.
Non-production is totally fine.
Okay.
I don't see that many.
Interesting.
You all are great software engineers.
Never made this mistake.
Unfortunately some of us, like myself, tend to make this mistake.
Which means that in a production system, we tend to believe that this is probably a non-prod
box.
Make this one-off mistake where we just go in and run a command like this.
What happens?
The consequence can be something in between an hour's long outage that involves maybe
millions of dollars of customer impact to something that's like...
Restart the box or fail over to another AWS...
Whatever you call it.
But today's talk is going to talk about what happens right after you run this command.
What happens when your complex system fails.
So my name is Aish, and like Jeremy mentioned, I work for a company called PagerDuty.
And without talking a lot about myself today, I'm just going to dive straight into the topic
and talk about, first: What are complex systems?
So the title of the talk easily talks about what to do when complex systems fail.
So go with the definition of what's a complex system, let's just go take a line from the
English Wikipedia.
A complex system is a system that's composed of many components, which may interact with
each other.
Sounds very, very specific, right?
Definitely not.
This kind of covers almost any software.
Software has this modular principle, in which you have classes or objects or function, or
any different components that talk to each other.
This sort of inherently means that almost any piece of code that you ship, apart from
that one hello, world example -- or maybe even that too -- is a complex system.
You can ask me what's not a complex system.
Well, if you build a bottle opener, that's not a complex system.
So unless the software system that you build...
Nay, the system that you build is a bottle opener...
That's more than likely a complex system.
And again, you might ask me: What's the deal with all the complex systems?
But let's first address the elephant in the room.
The elephant in the room is: Why are you talking about failure?
And...
The second thing is...
Why are all of these emojis around?
Well, I am a Millennial.
So hence all the emojis in the talk.
So hold my avocado for a second while we take this detour into the world of academia.
So this paper was written, as you can see on the screen -- it's called how complex systems
fail, by Richard Cook.
Who was a medical doctor.
MD.
And it was written in the year 1998, about patient care and health care systems.
The paper talks about a bunch of scenarios on how things fail.
But again, the object of this talk was not to describe how things fail, but rather to
talk about what to do when things fail.
But there's a very great quote from the man himself, Richard Cook, about failures in general,
and failures of complex systems.
To quote him: Failure-free operations require experience to deal with failures.
Now, let's take a pause here and think for a moment.
Failure-free operations require experience with failure.
This is very, very counterintuitive.
This means that in order to deal with failure, you need to have prior experience.
Now, wait.
Isn't this more like a chicken and egg problem?
Don't you mean you have to have experience to go and fix things, which means you have
to go and break things first?
So this is what this talk is about.
Here's the structure of today's talk.
First, we'll talk about a horror story.
The horror story is not like one of those high budget Hollywood movies.
It's just an operational nightmare that any one of us could be in.
And this is like from my experience, what happened to me in the past.
When we did not have a good operational and an incident management framework in place.
The second part of this talk deals with lessons learned from the story.
So in the story that I'll be telling you, we'll be seeing a bunch of failure modes in
dealing with failures.
Systematic failure of multiple things, including communication, including tactical things,
as well as talking to the customers.
So the second part is dealing with how to actually deal with these failures and dealing
with failures in turn.
That's kind of very meta.
And the last part is a review about the things that we will be talking about.
So first, a story about failure.
So chapter one.
This is fine.
To kind of give you a background, this happened to me while I was an intern.
I was still in college.
I was at a small startup somewhere.
And it was the middle of the night.
I get a phone call.
And it was the CTO of the company, calling me to say...
Hey, it looks like this particular piece of software that you shipped is not working,
and it looks like this big customer is not getting their reports.
I didn't know what was the next thing to do.
I was told that there was some bridge number that I needed to dial in.
I was told that there was some HipChat room I was supposed to go in back then and talk
to other engineers.
Being an engineer who was still in college, that was the expectation of me.
I did not for one moment know what was the meaning of being on call.
I was roped into this incident call to deal with the systematic failure of complex systems
without actually being equipped with the knowledge of how to deal with these things.
Almost every engineer I knew was on call.
This was a 20-person startup.
So 16, 17 people were on the call.
Most of them were half asleep.
It was a Friday night.
Someone had even dialed in from a bar.
I could hear the background noise of people talking and laughing in the middle of all
of this chaos.
Most importantly, we were all trying to do the same exact thing.
We were trying to go to the last commit on GitHub, go to the commit, and see what happened.
Unfortunately, since all of us are doing the same thing, and had we been, like, smart enough,
we probably wouldn't be doing this.
But since we're all doing the same thing, we didn't get to the solution.
Like you guessed correctly.
The problem was with a database machine somewhere, and since we had a bunch of machines for different
services, we just didn't know what was going on.
We had no clue where to start.
So 16 engineers in the middle of a Friday night, someone dialing from a bar, an intern,
and the CTO of the company trying to just go and fix reports for this big customer...
Not even knowing what went wrong and how did this particular thing fail.
So the thing was that our logging service was failing.
The log aggregator was failing.
So the box ran out of disc space.
The database machine.
As a result, like, our reports were not being delivered.
That was the actual problem.
Just spoiler alert for people.
Now, that was now chapter 1.
This is chapter 2.
A dark and stormy night.
These lines -- was a dark and stormy night.
The rain fell in torrents, except at occasional intervals, when it was checked by a violent
gust of wind.
These lines are the famous antipattern in English literature.
If you look at them closely, whenever you write an English language-based essay or anything
that's not a poem, this is how you should not be starting.
Now, why am I drawing a comparison between English literature and an operations call?
That's probably...
That's definitely because any engineer that you asked on that call, on the call that I
was in, about...
Hey, do you know what we should do next?
Gave answers almost as out of point as these lines.
The answers used to be...
Well, like, there's a wiki somewhere.
There could be something out there.
But no one definitely had an answer about what was the problem, and what they were doing.
So it was almost as ambiguous as these lines here.
Most importantly, there was no clear leader amongst us.
It was like a herd of sheep where everyone was trying to follow each other.
There was no one to coordinate.
In the middle of this, we kind of make a segue to chapter 3.
This is the exec-swoop.
Like the title actually says it all...
The CEO of the company -- this was a tiny startup at the time -- jumped into the call
and started asking questions.
And these questions included things that, as an intern, I was definitely not aware of.
But there were also engineers on the call who were aware of it, but they didn't really
have the answers to these questions.
You ask me: What questions were we were talking about?
This was at 2:00 a.m. on a Friday night.
Can you send me a spreadsheet with a list of affected customers?
In the middle of the night, when you're dealing with fires, the last thing that you want,
literally the last thing that you want is an exec standing on your head and asking you
to send a list of the affected customers.
You barely know what the problem was.
You barely know what you're dealing with.
You're able to talk to the right sort of engineers who know the system inside out.
And in the middle of all of this, you're trying to get that one spreadsheet with some customers
who have been affected.
Apart from that one customer who initially put that request that things are not working.
So...
Like, we were kind of confused about what to do.
So you know what?
Adding more to the chaos, adding more to the complexity, we decided to do both.
Which meant that we first decided to get the list of affected customers, and then go and
deal with the actual problem.
Which meant the actual time, the total amount of time that we spent in dealing with this
incident was much longer.
We spent almost two hours trying to get this list.
It was finally at 4 a.m. that we got to know what the problem was.
That the particular log aggregator was not running and the servers were running out of
memory, which led us to finally go and fix the problem.
The story does not end here.
The morning after, the morning after didn't bring us any hope.
The morning brought us some more pain, some more agony.
I was getting the blame for the incident, despite being an intern.
And I had to go and do a lot of things despite just cleaning out the thing.
I had to add a cron, add another metric for monitoring, just because I was blamed.
The question you might ask that we all can see is...
What's wrong with the picture?
It's a cat image that's upside down.
I'm definitely not asking about the cat image.
I'm asking: What's wrong with the story?
If you follow Agile, DevOps, or any of these hipster terms, there's definitely a lot that's
wrong in the picture.
Despite being part of a new age startup, not one of those mammoth old companies that we
try to stereotype companies in, this was still the case.
So other things went wrong, but we can categorize things that went wrong into two distinct buckets.
The first one: We did not know the difference between a minor incident -- a minor incident
being a recurring thing.
Something that can be automated.
Something that does not require you to wake up in the middle of the night at 2:00 a.m.
and go and log into a computer.
Among all the things to do.
And the second category of thing that sort of went haywire was not having a framework
or a dedicated method to deal with a major incident.
Had it been a major incident.
So we'll kind of talk about: What's a major incident in a while?
But for now, bear with me that there's two different things.
There's a minor incident and there's a major incident.
Before we try to address these problems that we saw, let's move on to the second part of
this talk.
Lessons learned.
Lessons learned from this particular horror story.
Lessons learned from these mistakes that we made in the call.
And how can a good incident management framework address the concerns and the problems that
we saw on the call?
So before we start to talk about the framework itself, let's kind of see how can we deal
with the first part of the problem?
And the first part of the problem was: Not being able to identify whether it's a minor
incident or whether it's a major incident.
So the first thing that companies, organizations, and teams need to do is to define, prepare,
and measure what's a major incident.
So it's critical to define business failures in terms of business metrics.
So, for example, if you are an online retailer, it might be the number of checkouts per second.
If you're an online video or audio streaming platform, it could be the number of streams
per second.
At my current employer, PagerDuty, it's the number of outgoing notifications per second.
So defining your most critical business metric and tying it back to the engineering system
sort of helps you build that understanding throughout the company, about whether are
we in a major incident?
Is there a massive customer impact or not?
The second thing is: Get everyone in the company to agree on the metric.
This means right in from the CTO, the CEO, all of the execs, to someone like an intern
must agree that this is the metric that we're looking for, and once we cross this threshold,
we are in a major incident.
So like in the story, we did not really have a metric to talk about.
We are affecting one customer.
That was fine, but it did not require the entire company to be awake.
So defining these metrics require you to look at the amount of time and every you are going
to spend in dealing with these types of problems.
The second part with preparation.
So the best organizations prepare with failure beforehand.
Like Richard Cook said, to quote him yet once again in this talk, failure-free operations
require experience to deal with failures.
Companies have their own versions of simulating failures.
A few companies call it Game Days or Chaos Monkey or one of those buzzwords.
It could be automated, it could be manual.
It could be as simple as restarting all servers randomly on a Friday.
That's what we do at PagerDuty.
We call it Failure Friday.
This is to prepare your people to deal with failure beforehand.
And the most important part of this triad is measuring things.
So measuring the impact during these failure simulation exercises would help you go and
add and redefine those metrics.
If required, tweak them, and get other stakeholders in the business to agree.
Once you complete this triad, you need to make sure that failure should be unique.
If not, we should be able to automate the response.
For example, in my case, the failure was just something that was supposed to run on the
machine and clear up the space from old log files did not really run properly.
So if that was something as simple as going and freeing up space from that machine, in
terms of failure, that should not be wasting human time.
Human time is precious.
If you can automate it, just go and automate the thing.
So like I said, remember...
We should be only triggering major incident response if you are in a major incident.
So getting those 20 people on call and trying to solve the problem will only make sense
if it is something that could not be automated.
If it was a button click away to just go and clear up the log space, you should probably
have done that.
Well, apart from hindsight, let's just move on to the meat of this talk.
So the meat of this talk is talking about this framework that's inspired by the National
Incident Management System of the United States.
This is a framework developed by the Department of Homeland Security, and it's used for dealing
with national calamities and other incidents as classified by the national government.
When I use these bigger words, people give me a look and say...
Aren't we talk about software and IT applications?
How would something that was designed to deal with natural calamities...
How can something like that be applied to software and IT and operations failure?
Well, the core of this is to deal with failure.
When you try to categorize failure, the failure modes are kind of similar.
So the lessons learned from the NIMS framework can for sure be applied into a software failure
mode as well.
So the first thing, like most software developers know, is the singular responsibility principle.
And the single responsibility principle that I'm referring to is not about code reuse or
keeping your things dry, which is...
Don't repeat yourself.
Or just cleaning your code.
This is about: Whenever you get paged, whenever you get a phone call from your CEO, your CTO,
or someone, make sure that there's only one person responsible for one task.
Do not have a redundancy there.
The redundancy might be good when you're actually writing code and deploying it into distributed
systems.
But when you're talking about people, having the same task being done by two people in
the middle of the night is not the way to go.
Particularly not in a major incident, which might run for hours and hours.
So since I just mentioned the single responsibility principle, there can be different roles.
Based on the names, there are different roles that people can take when they join this incident
call.
So the first -- and when I say an incident call, an incident call I refer to as a major
incident.
If you have online utility, your customers are not able to check out.
What do you do?
It's an all hands on deck scenario.
So the first thing that comes to mind is the subject matter expert.
So the subject matter expert is sometimes what we call a resolver, or a responder to
the particular event.
They are the domain experts.
It could be someone from the team who built the service or knows about it well enough
to go on call with it, and fix the things that are necessary.
In my story, we had 15 engineers working on the redundant parts of the same system.
And we are all the SMEs or the experts.
You don't need 15 experts.
One person.
One person per logical component so as to avoid confusion in an incident call is sufficient.
Now, this is the mantra of the subject matter expert.
Never hesitate to escalate.
And this kind of comes on the back of things like...
Well, since I was an intern, and since...
Let's say that if I was on call, and I was called by my CTO, what's the first thing that
I do?
The first thing I should be doing, as with the framework, is just saying it out aloud
that I don't really have enough context on this.
So please escalate it to the next level.
Please put in someone else who kind of knows better about the system.
So that I am not on call for this thing.
So never hesitate to escalate.
As an SME.
The next and the most important role in an incident call is of what we call an incident
commander.
And before we kind of get into the details about what is an incident commander and what
are the roles and responsibilities, let's just take a slight detour.
The image in the background is of Gene Krantz.
He is known for being the flight director, I guess, for the Apollo 13 rescue mission.
So if you have seen the movie Apollo 13, you might have seen someone coming in with a vest
and trying to get everybody on board to work together as a team, and try to get a group
of astronauts stuck in space back to earth.
So draw a parallel there.
This is what an incident commander does.
If your database has been dropped or your business is not able to somehow function in
the middle of the night or the middle of the day -- it does not really matter -- the incident
commander is the sole point of contact.
The incident commander is the person who drives the entire incident call.
So this means that the incident commander is responsible for single handedly talking
to everyone, from the CEO right to the handlers on the call, and making sure that everybody
is working together as a team and trying to work towards the solution of the problem.
So what's the first thing that an incident commander does?
The first thing that an incident commander does is they notify that the company is in
a major incident.
And this we are talking about an internal notification.
So this actually means that jumping onto the incident call and saying it out aloud.
I have been notified that there is a major incident going on and I'm the incident commander
for this call.
Is there anyone else on this call?
Which means you're trying to gather subject matter experts.
That helps us make segue into the next part of our incident commander roles and responsibilities.
You verify that all subject matter experts are present on the incident call.
This is essentially just asking out aloud whether people from these different teams
for whatever thing that's going down are present.
And then you get onto the long running task of dividing and conquering.
So what do we mean by divide and conquer?
Isn't the incident commander the single point of contact for this?
Yes, but the incident commander is not the subject matter expert.
The incident commander does not need to know the ins and outs of the system.
The incident commander does not need to be a principal or senior architect.
The incident commander is just there to coordinate and help people work together so they can
work as a team.
So the incident commander's responsibility is to delegate all actions and not act as
a resolver.
The other key thing about an incident commander and an incident call is to communicate effectively.
This also means to maintain order.
To try to control the chaos that comes out of a tired set of people trying to work towards
systems-level solutions.
So the incident commander needs to take in human empathy during the call.
Which sometimes may translate to just asking people to drop off the call and go and spend
an hour outside.
They don't need to be in the call, if there can be someone else on it.
So the incident commander is also responsible for swapping in and out people from an incident
call, based on their judgment.
And effective communication also means sometimes people might be harsh towards each other.
Like all human beings, people sometimes get tired on an incident call.
They'll may shout at each other, not use the best words.
So the incident commander's responsibility is to make sure that the communication there
is also great.
Next, the incident commander is responsible to avoid the by-stander effect in the call.
What do I mean by this?
Rather than saying something like...
Please say yes if you think it's a good idea to do so...
So if I am an incident commander, rather than asking for permission to do something, you
ask for something like...
Is there any strong objection to do that?
You kind of get one of the suggestions from a stakeholder, preferably an SME, a subject
matter expert, on the call, and you ask the question: If there's a strong objection to
do that.
This helps avoid the by-stander effect.
We have seen it in, like, the place where I work, and other places as well.
That this kind of helps cut a lot of situations where a by-stander effect is seen.
Particularly in incidents calls, when you instead ask permission to do things.
The next thing is reducing scope.
I guess we all have been in one or more times where something is going on in a production
system, your company's core business has been affected, and just for the sake of information,
you leap into that incident call and just know what's going on.
So one of the key things about being incident commander is to reduce scope.
Which means not allowing people, apart from those who are required, to be actually present
on the call.
So this is just done to not burn out people.
Having more than the necessary number of people on a call just means there's a lot of crowd,
a lot of noise, a lot of more confusion.
There could be a clash of ideas or opinions.
People are opinionated.
Particularly engineers are.
Which means the incident commander's responsibility is to reduce scope.
Which means kicking our people from incident calls if you feel, as an incident commander,
that they should not be part of the call or their help is not required.
You could politely ask them to leave the call and say we will get you added back to the
call if you actually need the help.
The next part is maintaining order.
This is something that we kind of touched upon before.
But one of the things that...
Which we sort of talked about, in the communications part as well, was about: Reminding people
to only talk once at a time.
So not having, like, multiple people talk at the same time.
The next role is of the deputy.
And like...
In the old Western movies, the deputy is not responsible for a lot of things in an incident
call.
What the deputy does is the deputy acts as an assistant to the incident commander.
This means that the deputy is responsible for getting all subject matter experts up
to speed about what's happening.
So imagine that you are, again, in the middle of a chaotic incident.
Something has gone wrong with the production system.
And the deputy sort of calls you in, in to your phone number.
You are a subject matter expert, joining the bridge, joining the call.
So the deputy is the person responsible for giving information to you.
So you might be adding to the incident call five hours after it started.
Which means you probably have no context about what was happening.
So rather than having the incident commander stop all of his other tasks and come back,
and talk to you, the deputy kind of acts as the backup incident commander, calls you on
your phone, pages you, whatever, reaches out to you, gets you on the call, and fills you
in about what was happening.
In the middle of all of this, the incident commander can carry on with their responsibilities,
so that their standard workflow is not affected.
The other responsibility of the deputy IC or the incident commander, is to liaise with
the stakeholders.
Remember, I kind of mentioned that the incident commander is responsible for making sure that
everybody in the company knows we're in a major incident?
Now, in the middle of all of this, you might...
The incident commander might get an email or a message on their Slack or a phone call
from someone who is like a CEO.
So rather than having the incident commander interrupted by these external interruptions,
the deputy incident commander is responsible for liaising with the stakeholders and the
incident commander.
So the deputy incident commander acts as this particular liaison between this particular
call and the stakeholders.
So this could be someone who is an exec in the company or someone else who is not part
of the actual incidence response call, but wants to know about what's going on.
Talking about not being part of the actual incident call, but wanting to know what's
going on, there's a dedicated role in this incident management framework, and it's called
a scribe.
What does a scribe do?
A scribe documents the timeline of an incident call as it progresses.
So this is just someone typing on your chat medium.
Like, it could be...
It doesn't have to be a chat medium.
It could be a Google Docs or it could be your Slack, your HipChat, or a Skype.
Any sort of messaging or shared documentation that's accessible from people within the company.
This is internal.
The document at the time when the call starts...
This is the time that the incident call was started.
And then they start taking notes about what people said and how things are happening.
This kind of acts as a bridge between people on the call and people off the call.
So the scribe necessarily tries to get feedback from people who are outside the call.
So if I happen to know something that the SME in this particular incident does not know,
I can very well talk to the scribe on the Slack message or the Skype message and tell
them that it looks like whatever you guys are doing on the call, whatever you people
are doing on the call, it may not be really accurate.
There's an alternative.
And the scribe could, again, get that feedback relayed into the incident call, without having
you or someone else outside the call jump back into the call.
The next role is that of customer liaison.
In my particular story that we talked about, the CEO sort of jumped in and started asking
questions about customer-facing things, about which engineers did not really know about.
So the role of the customer liaison is kind of to avoid that entire thing, where an exec
comes in and starts asking questions about customer-facing things.
The role of the customer liaison is to act as the bridge between the customers and the
incident call.
This means making sure the number of...
People get tweeted all the time with "You know, your site is down".
Keeping track of those things, keeping track of support queries that you may get in your
downtime.
The customer liaison acts as this bridge between any customer-facing request and the internal
incident call, as such.
They also directly talk to the IC, the incident commander.
And rather than just talking to the subject matter experts and confusing them with these
things, they let the incident commander make a call for things like...
Whether to get that spreadsheet, or whether...
Should we go and focus on the problem first?
So the incident commander actually makes that call.
The request comes in from the customer liaison.
The customer liaison is also the person responsibility for notifying people outside of the company
about this incident.
So this involves sending out tweets about "Looks like we're having some problem with...
Some service".
Something like an API.
And the entire site is down.
Putting these updates on your status pages.
So this role is specifically targeted for the customer liaison.
Because they are in constant touch with the outside world, so they can have a better picture
about what to put on the outside world.
So what to put on the outside...
How do you translate this internal impact to the outside world?
The customer liaison works with the incident commander to frame that particular message
that is to be posted on the outside world.
And yeah, the customer liaison also keeps the incident commander apprised of any relevant
information.
So there may be a large number of customers complaining about this particular thing.
So that the incident commander can use their judgment wisely about the situation while
the call is actually progressing.
The next thing.
The incident commander role sounds a bit heavy.
So something that the operations response guideline proposes is to allow a great transfer
of command from incident commander, from one incident commander to the other, if necessary.
What does this mean?
This simply means calling in someone who is able to become an incident commander and giving
them information about how far the call has progressed and signing off as incident commander.
This is just to avoid burnout.
The next part deals with the thing that we saw in chapter 4 of my story.
The morning after one.
So blameless postmortem is something that the industry has talked about for years.
John Osborne from Etsy has written great stuff about it.
I won't spend a lot of time talking about blameless postmortems.
But the TL;DR of this is that postmortems need to be blameless.
You can blame someone for any incident.
At the end of the day, we are all human beings.
We are all more or less equally likely to make mistakes.
Be it someone who is a C-level exec, a CTO, or someone who just started at the company.
A new hire, an intern.
You're all likely to make mistakes and impact your company.
Blaming people for things that go wrong with complex systems is kind of pointless.
Remember, something that great companies do.
You really can't fire your way into reliability.
So firing people for making -- creating a major incident, or having a negative business
impact, is not the way to go.
So there's, like, a few gotchas about the role of incident commander.
One of the most common things that we get is: Who can be an incident commander?
Does it have to be someone really senior in the company?
The answer is no.
Anyone can be an incident commander.
Anyone who is able to kind of communicate well, knows the systems well enough, and is
confident enough that they can deal with the chaos.
Can become an incident commander.
And to kind of make sure that you are comfortable with being an incident commander, the three-step
mantra: Define, prepare, and measure -- comes into play here.
So if you want to be an incident commander, for example, or if I was an intern and had
I wanted to become an incident commander, the first thing was: Be prepared for it beforehand.
Run these chaos exercises, these game days, or run these chaos experiments before the
failure.
So you have enough experience to deal with an actual failure.
It does not really have to be someone really senior in the company.
And these are the lines from a major incident call.
Names have been redacted.
For example, you join in to have an incident call as an the SME.
And you have been trained.
Suddenly you kind of realize that you're probably four or five only SMEs on the call.
The first thing you do is you ask: Is there an incident commander on the call?
If you don't hear back anything, if there's just crickets around, you just say it out
aloud.
This is Aish, and I'm the incident commander for this call.
Well, you just say your name out.
I'm saying mine.
You can say "I'm Aish, and I'm the incident commander,", but that's probably not gonna
work out that way.
The next is wartime versus peacetime expectations.
So a lot of times we don't get paged.
Things don't fail.
Most developers don't go and run suko rm rf in production or drop tables in production
without taking a backup.
That's one of the things.
So what's an expectation for an SME, for you or me, or for someone who is an engineer who
works in a software development team and goes on call for things that they build?
Once in a while, things fail, but that's fine.
What's the expectation then?
What's the peacetime expectation?
The peacetime expectation is that you are just prepared to deal with failure.
That's it.
You can go have a life.
There's nothing about being on call that you should be worried about.
And the wartime expectation is to follow the guidelines and to stick with it.
Not just to go and join in an incident call about how things work.
You can kind of do that offline.
Now moving on to last section, the review.
These are some key takeaways from the first incident that we talked about.
And then the framework and how we can kind of talk about some major takeaways.
The first thing is: Shit happens.
Yeah.
Prepare for it.
Run simulations.
Prepare for it.
Train your people.
Make sure that your systems can deal with failures.
And things can go wrong all the time.
Just make sure that your company -- everyone, including the business, as well as the technology
part of it -- knows it well enough.
Develop on-call empathy.
So we might have seen this Twitter hashtag, #hugops.
Things can go wrong for anyone at any time.
So it's important to have empathy for someone who just got paged, who's working on a problem.
Don't try to intimidate him.
Just make sure you're a good team player and you follow the rules and have some empathy
on it.
If you're an incident commander and you're tired, feel free to step down.
There are no gold medals for on-call heroics.
No one has received something like the Victoria Cross for being an on-call hero.
Just make sure you don't become an on-call hero.
Don't try to burn yourself out to prove that you know how the system works and you can
deal with failures firsthand.
So before I leave all of you today, here is the one slide that if I were to condense this
particular talk into a minute-long talk or maybe a lightning talk, something that's a
key takeaway for most companies and teams that build software is that people are the
most valuable asset.
Don't burn out your people having them do something that can be automated.
Be it cleaning up logs, restarting a server, just putting up a cron.
These are things that can be done automatically.
You don't need to trigger an incident call and have ten people on there who have done
something as trivial.
People are your most valuable assets.
And thank you.
That's it.
(applause)
>> Nice work!
We have lots of questions.
That caused lots of discussion and debate, as anyone looking on Slack will know.
Firstly...
Everybody on Slack agrees that Mirabai should be the scribe on all calls in all situations.
That was an easy one to start with.
There's a question about where and when self-healing systems can be used.
Could they be used at nighttime, if sort of an incident that was in that area happened?
Can code roll back?
How much value is there in self-healing systems?
>> So like most answers, the answer is "it depends".
There's no one thing fit all solution.
So the generic thing has to be tweaked upon your needs.
But more or less, it should work out, and...
Well, it depends.
When you ask how it depends...
I need more specifics to kind of answer that question.
>> Follow-up question on Slack for later, then.
There was a question about how actually logistically you managed this process, in terms of things
like chatops.
Do you think that chatops work effectively when there's a need for a call around incidents?
>> Definitely, yes.
So if you...
I can give you an example from my current employer, PagerDuty.
So we have a chatops command to start a major incident call.
It's an in-house thing.
It's not Open Source yet.
But if you feel like a minor incident is getting escalated into a major incident, we have a
chatops command that automates the process of calling in the incident commander, the
deputy commander, the customer liaison, the scribe.
All these people get paged about these things.
So chatops definitely helps.
>> One of the questions is around the scale of company that this is appropriate for.
I'll just pull up a couple of examples.
So in situations where you're working in a really small team, where there might be actually
less people than even the amount of roles that are there, how does that work?
And don't you quickly end up risking getting in a situation where you've got 50% of your
whole company on the call and actually who's out there doing the work?
What sized team do you need to make this appropriate, do you think?
>> That's a great question and it's something that I get quite often.
So the most critical role -- and if you're a small company -- is to have the incident
commander.
Apart from the subject matter experts, who are the meat of the problem.
The people working towards solving the problem.
You'll need an incident commander and you'll need a scribe.
These are the bare minimum.
And the customer liaison.
You'll need three people on the call, apart from the people who are actually working on
getting to the solutions part.
So if you're a 10% team, I'll still recommend that you have three people there.
And this means that apart from the customer support person, the customer liaison, who
is not from the engineering org, who is from the customer support org, you have two engineers
on call.
Which is something that a decent 10% company should be able to do, I guess.
>> In that situation, are you just removing some of the roles, or are you seeing the three
people on the call merging some of those roles?
>> You end up merging all of these roles.
Which translates to the fact that the scribe would have to add as the deputy incident commander.
And some other roles would be mushed.
The customer liaison would also help the incident commander.
So there is definitely some overlap of responsibilities that will happen.
But that's definitely better than not having a structure at all.
>> Okay.
Final question, then.
What tools do you use in practice to implement these roles and procedures?
Are there parts of it that are automatable, that you can recommend?
Are there any specifics that you recommend we all go and look at?
>> Since you asked the question, a shameless plug.
We use PagerDuty.
But apart from PagerDuty, we use some great monitoring tools, which help us get the data.
And we use a bunch of Open Source tools.
Like the internal chat plugin that we wrote.
Which was just built on a bunch of Open Source chatbot commands.
So chatops, good managing, and PagerDuty for incident management.
>> Great.
Thank you very much!
Không có nhận xét nào:
Đăng nhận xét