Complex Systems Break – The Changing Approach to Incident Management

The technology world is changing rapidly. If you stop and think about it for a moment, the level of technology we now hold in our hands through smartphones is incredible. If we go back to when my IT career started, computing power like that took 12-14 weeks to be delivered and needed a dedicated room to house it.

But, what has also changed is the simplicity with which we interface with such technologies. Apps are a convenient and easy way to package a range of actions we require for news, games, transport etc and the user experience is key to apps being successful.

However, as IT professionals we’re all aware of the inverse complexity that lies behind this simplified and intuitive interface. We understand that delivery consists of many complex systems working together. And we know there are often generations of legacy systems that support some aspect or other.

How are Approaches Changing and What are the Drivers?

Traditional approaches to IT service management (ITSM) often if not overtly, allude to a nirvana of zero outage systems. Vendors sold hardware with an increasing number of 9’s, whilst service management looked to problem management as a holy grail. The idea that the aim was to be more efficient rather than do more was often a prevailing mantra.

However, in recent times there has been an acceptance that there are forces that affect incident management that are outside of ITSM’s control. Business agility is critical for successful companies and this has led to solutions and services becoming more complex, and this greater complexity has led to a change in approach for many organisations.

Instead of focusing attention on simply reducing downtime though greater investment, smarter organisations are looking at faster responses to incidents. They start from a position that accepts outages will happen and that swift resolution of issues is what’s needed. As a result, they aim to avoid suffering significant downtime and unwelcome press coverage.

I’ve manged to get this far without mentioning Digital Transformation, but it’s critical to this. The demands placed upon IT systems to cope with rapid business changes mean that aspiring to never fail is an unrealistic goal. Agile and DevOps practices mean that smaller more iterative changes are constantly happening. These changes start to drive the conversation from never break to fix fast. The new mantra of speed rather than perfection and the principle of fail fast, all point towards a new approach. But how does this new approach work and what does it look like?

What Should You Do?

Figure Out What’s Important?

Traditional approaches to incident management call for across the board consistency in capturing everything as a ticket. However, many organisations are seeing this as excessive with little true analytics being used on the majority of tickets. Some now only ticket major incidents, with all other incidents being manged through other channels, as many other parts of the organisation have been doing for years. Similarly, make sure that when you’re dealing with major issues you communicate effectively and distinctly to ensure that participants understand the importance. You have to work out what’s critical for your business and work from there.

Look at New Tools and Techniques

As teams are increasingly global there is a need to connect them using tools. This is no more important than when dealing with incidents. Increasingly, digital incident responders use tools like HipChat and Slack, enhanced with direct operational integration via ChatOps plug-ins like Hubot. This allows for the simple raising and updating of tickets via chat commands, saving the toil of manual updates. But it’s not just tools that are innovative. Incident management techniques like swarming are improving response times and quality. Rather then relying on tiered responses to incidents, swarming allows self-forming teams of various backgrounds to work on tickets immediately. The Consortium for Service Innovation (https://www.serviceinnovation.org/intelligent-swarming/) has a number of case studies highlighting the increased resolution time and multiple teamwork benefits of using this approach.

Use Post-mortems Properly

Post -mortems are the most effective way to understand what happens with incidents. However, to make them truly effective they have to be blameless and avoid holding someone accountable. Experts in human factors and safety like Sidney Dekker in his book Field Guide Understanding Human Error, emphasize the need to examine the system that leads to a given individual’s actions. Human error has little or no place in safety science thinking, which focuses on the context giving rise to the error. You need to think about what made an engineer think something was the right thing to do at that time. There is also little point in looking for a single root cause as in complex systems failures there are invariably multiple causal factors.

Increasing complexity has led to a realisation that existing approaches to incident management need to be improved. Although avoidance of incidents is still key (containerisation, infrastructure as code, DevOps all contribute to this), better response to incidents is increasingly important. However, without understanding the human factor and avoiding the blame game, your progress will be limited at best.

If this article has piqued your interest then please join me at itSMF UK’s Annual Conference (ITSM18) where this will be the focus on my presentation. You can see the full agenda here.

Duncan Watkins

Duncan is a senior consultant at Forrester Research specialising in business technology. He provides research-based consulting services that helps technology professionals leverage Forrester's proprietary research and expertise to meet the ever-changing needs and expectations of their stakeholders.