If you are an IT professional supporting major production environments and applications, you have most likely experienced a significant system outage at some point. We had one of those events this week. As with previous experiences in other organizations, I saw people at their best come together as a team, working diligently to restore systems. This team included IT, clinical, and operations staff.
I know CIO colleagues who recently managed through a week-long outage of their business systems in one case, and a multiple day outage of their electronic health record in another. They could probably share similar lessons following those experiences.
In the spirit of teaching and learning from one another, I offer these key points if you have a significant event:
- Quickly bring in all required parties to diagnose and resolve the problem
- Quickly assess the impact and engage the right operational leaders
- As you understand the scope and extent of the situation, bring in additional expertise
- Determine the remediation steps needed from a systems and operational perspective
- Establish a communication plan with regular updates to the user community; provide very specific and clear information that staff need in order to do their work during the incident
- If the incident lasts longer than expected, ensure that key staff have backups so they are able to get some rest and do formal handoffs for continuity
- Assemble the right people to monitor the situation and jointly make decisions on recovery taking into account all operational impacts
- As the CIO, own it and provide updates to executives with full transparency
- After the incident, do a full Root Cause Analysis (RCA) of what happened and debrief on how the incident was handled; identify action items, assign the person responsible, establish due dates and track progress
When I described the incident we had this week to my husband, he could relate. Before he became a minister, he was the data center manager for a national benefits consulting firm. He recalled major outages they experienced. But he was quick to put them in perspective, saying they were just moving money around, not running a hospital and taking care of patients.
Health care is different. We are dealing with people’s lives. Patient safety is top priority. For each checkpoint call during the incident we followed the same framework — which systems are available, which systems are unavailable, what patient safety concerns do we need to address, and a roll call of all hospitals for any specific issues.
In the command center, we discussed in detail the operational impacts with clinical leaders. Listening to the clinical leaders’ questions and concerns involving basic workflow and patient care reminds us that what we do matters. As IT professionals, we don’t touch the patient. But we are a critical part of the extended patient care team as the systems we provide and support are depended on by the clinicians and caregivers 24/7.
Our debriefing and RCA from this incident has already begun. There will be opportunities to improve. This will include steps to prevent a similar incident in the future and to improve how we manage such incidents to ensure seamless operations and continuity of care.
As IT professionals, we work hard to ensure outages don’t happen. But if they do, we come together as a team and deal with it. I’m grateful to our very talented and committed team and honored to be working with them.