Triage for Incident Response
One of the main pressures around response to incidents is simply being overwhelmed with tasks, the outcome of so many demands and so much context-switching can easily be chaos, or poor quality quick-fixes. As with all real-time response, the key thing is to take a step-back, and triage the incoming requests as they arrive, prioritising those we need to deal with first, and deferring those that we can tackle later.
Red: Urgent Action Required
The service is down or degraded, users are being affected and it needs to be looked at urgently. These are the errors that page out-of-hours staff to respond and heavily affect users.
Pink: Walking Wounded
The site isn’t working as well as it should, users aren’t getting the best possible experience but any remediation needed can be brought into normal office hours, the tasks radiated out to the appropriate teams.
Black: No Chance of Recovery
The service has no chance of recovery, perhaps due to a chronic application bug. There is no way to bring the service back to full health without substantial planned work and coordination.
White: Discharged, No longer an Issue
A problem issue has resolved itself, a typical example would be a cloud hosting network problem or third-party upstream service failure.
Blue: Improvement Task
Something that will improve the service, our ability to sustain it, monitor it, or handle any issues with it. These tasks often fall into the background and never reach the top of the backlog. Unfortunately, these are exactly those tasks that help to free up time in the future and build more robust, sustainable services.
Putting Users First
By categorising incidents by triage we avoid the reactive approach, dealing with issues as they come in, in order, and not by urgency. With triage we ensure that the most pressing issues are tackled first, and that our users are put first.