The Post-Incident Review - IT Revolution#
Excerpt#
Post-incident reviews are a critical feedback loop that helps organizations improve the state of their systems and respond more effectively.
Once an incident is resolved, there is a tendency to move on and go back to normal daily work. This is a missed opportunity to gather critical learnings and understand true system behavior as well as process and system breakdowns.
There should be two types of post-incident review: local and global.
Local Post-Incident Review#
- Reviews the timeline.
- Identifies and discusses what went wrong.
- Discusses what went right.
Some of the most important questions to ask are:
- How could we have detected this sooner? Did we have the right triggers?
- How could we have diagnosed the incident more rapidly? Did responders have the information they needed to diagnose the issue?
- What would have helped resolve this faster? Do we require new triggers, data collection, tools, or processes?
- What specific actions should we take to improve?
- Where did we get lucky?
- What did we learn about how our system behaves?
- How could we have prevented the incident from occurring?
- What went well in handling this incident?
Record and Take Action#
Immediate tactical fixes are important and should be identified in order to stabilize systems as fast as possible, but longer-term and broad-based improvements should be discussed as well to identify solutions to avoid incidents from reoccurring.
Global Post-Incident Review#
Local post-incident reviews generate significant learning about localized behavior and system and process behavior, including the quality of response. But when teams capture reviews in a siloed way, the organization and other teams don’t get access to all the lessons learned.
Break Down Silos#
- Hold a Global Incident Review if a major incident has occurred.
- During the Global Incident Review workshop, teams and stakeholders should focus on the assessment of the impact to the business first and then to the technology stack.
- Tell the story of the incident to provide the best possible context and to drive the audience’s engagement.
- Discuss remediation plans and follow-up improvement items.
- Discuss what the organizations and all teams (not just the team impacted) can learn from the event.
- Identify improvements needed to diagnose the incident, including service impacted, priority level, and the correct resolver teams engaged to improve response time in the future.
- Review the repair steps and identify recommendations to reduce a future incident repair duration.
- Review the duration to initiate and complete activities to ultimately identify improvement recommendations.
- Assess whether incident communication was effective or if anything can be improved to reduce delays, confusions and lead time.
Take Action#
After an incident is resolved, the organization and team must improve their ability to detect, diagnose, mitigate, resolve, and prevent future incidents.
Improvement Items#
As part of the post-incident review, look for contributing factors to the incident and try to identify specific and actionable opportunities for improvement.
Use the same tools and processes to track post-review improvement items as you use for daily work. For example, if your team uses Jira to track daily work, use Jira to track post-review improvement items in the same way.
Think Broadly#
- How could we have detected the incident more easily?
- How could we have diagnosed the incident more rapidly?
- How could we have mitigated the effects of the incident on the customer experience?
- How could we have resolved the incident more quickly?
- How could we have prevented the incident from occurring?