Skip to content

The Post-Incident Review - IT Revolution#

Excerpt#

Post-incident reviews are a critical feedback loop that helps organizations improve the state of their systems and respond more effectively.


Once an incident is resolved, there is a tendency to move on and go back to normal daily work. This is a missed opportunity to gather critical learnings and understand true system behavior as well as process and system breakdowns.

There should be two types of post-incident review: local and global.

Local Post-Incident Review#

  • Reviews the timeline.
  • Identifies and discusses what went wrong.
  • Discusses what went right.

Some of the most important questions to ask are:

  • How could we have detected this sooner? Did we have the right triggers?
  • How could we have diagnosed the incident more rapidly? Did responders have the information they needed to diagnose the issue?
  • What would have helped resolve this faster? Do we require new triggers, data collection, tools, or processes?
  • What specific actions should we take to improve?
  • Where did we get lucky?
  • What did we learn about how our system behaves?
  • How could we have prevented the incident from occurring?
  • What went well in handling this incident?

Record and Take Action#

Immediate tactical fixes are important and should be identified in order to stabilize systems as fast as possible, but longer-term and broad-based improvements should be discussed as well to identify solutions to avoid incidents from reoccurring.

Global Post-Incident Review#

Local post-incident reviews generate significant learning about localized behavior and system and process behavior, including the quality of response. But when teams capture reviews in a siloed way, the organization and other teams don’t get access to all the lessons learned.

Break Down Silos#

  • Hold a Global Incident Review if a major incident has occurred.
  • During the Global Incident Review workshop, teams and stakeholders should focus on the assessment of the impact to the business first and then to the technology stack.
  • Tell the story of the incident to provide the best possible context and to drive the audience’s engagement.
  • Discuss remediation plans and follow-up improvement items.
  • Discuss what the organizations and all teams (not just the team impacted) can learn from the event.
  • Identify improvements needed to diagnose the incident, including service impacted, priority level, and the correct resolver teams engaged to improve response time in the future.
  • Review the repair steps and identify recommendations to reduce a future incident repair duration.
  • Review the duration to initiate and complete activities to ultimately identify improvement recommendations.
  • Assess whether incident communication was effective or if anything can be improved to reduce delays, confusions and lead time.

Take Action#

After an incident is resolved, the organization and team must improve their ability to detect, diagnose, mitigate, resolve, and prevent future incidents.

Improvement Items#

As part of the post-incident review, look for contributing factors to the incident and try to identify specific and actionable opportunities for improvement.

Use the same tools and processes to track post-review improvement items as you use for daily work. For example, if your team uses Jira to track daily work, use Jira to track post-review improvement items in the same way.

Think Broadly#

  • How could we have detected the incident more easily?
  • How could we have diagnosed the incident more rapidly?
  • How could we have mitigated the effects of the incident on the customer experience?
  • How could we have resolved the incident more quickly?
  • How could we have prevented the incident from occurring?