Cristan Massey considers how we can use chaos engineering to test our service processes.
Why do monkeys love bananas? Because they have appeal.
Okay, let’s move on from jokes and delve into the fascinating world of chaos engineering. In this blog, we will explore the history, value, and synergies between chaos engineering and IT service management (ITSM).
Chaos engineering is the practice of conducting experiments on a system to cultivate confidence in its ability to withstand turbulent and unexpected conditions. By continually running these experiments, we can bolster the resilience and reliability of our systems. As Dr. Kolton Andrew, Co-Founder and CEO of Gremlin, aptly describes it:
“Chaos Engineering increases the resilience and reliability of our systems.”
In today’s fast-paced world, organisations of all kinds require resilient and reliable services. Rather than reacting to unforeseen challenges, chaos engineering encourages proactive behaviours. It allows us to test challenging scenarios before they impact our businesses.
By investing in resilience and reliability, we can ensure uninterrupted operations and exceptional customer experiences.
A brief history
A team of engineers working at Netflix embarked on an innovative experiment—intentionally introducing chaos and failure to systems. Their objective was to test the resilience and identify weaknesses in the system. This ground-breaking approach led to the creation of Chaos Monkey, a tool that autonomously shut down instances of Netflix’s services in production, simulating failures.
“The Chaos Monkey experiment showed that by deliberately introducing controlled chaos, we could uncover hidden failures, isolate them, and fix them before they caused major outages or customer impacts.”
Recognising the value of chaos engineering, Netflix open-sourced Chaos Monkey in 2011 and continued to expand its arsenal of tools. Today, chaos engineering has evolved into a discipline with multiple tools, processes, and organisations embracing its concepts and principles.
Even within the realm of ITSM, chaos engineering has found its place in ensuring the development of more resilient, reliable, and customer-centric systems. In my recent roles, I have witnessed first-hand the transformative impact of combining chaos engineering with ITSM.
ITSM is vital for maintaining stable IT operations and aligning services with business needs. However, without integrating chaos engineering, organisations miss proactive testing opportunities. Without intentional experimentation, hidden failures and weaknesses can go unnoticed, leading to major incidents and customer dissatisfaction. The absence of chaos engineering hinders continuous improvement and limits system resilience.
Combining chaos engineering with ITSM brings proactive measures to identify vulnerabilities, improve incident response, minimise business impact, and foster a culture of improvement. This synergy enables organisations to deliver reliable, customer-centric services and stay ahead in our modern digital landscape.
My own journey
In my own professional journey, I had the opportunity to be part of a team that facilitated chaos days – an integral component of our chaos engineering practice. Our responsibilities revolved around planning, preparing, and executing these interactive sessions. To ensure the success of chaos days, we assembled a group of individuals known, aptly enough, as the chaos monkeys. These individuals typically included development leads and those with a deep technical understanding of the systems in question.
The chaos monkeys’ role was to create controlled chaos by designing and implementing various scenarios that would intentionally disrupt the development environment. Working closely with the service managers, who acted as facilitators during these scenarios, the chaos monkeys orchestrated incidents that mirrored real-business major incidents. By leveraging the existing major incident process, we not only promoted process maturity but also gained a better understanding of how well we worked under pressure.
Once the engineers swung into action and resolved each scenario, a retrospective and post-incident review took place. During this critical phase, the ITSM team collaborated closely with the development teams to identify areas for improvement. These insights and recommendations were transformed into tangible chaos actions, which the team documented and managed through to completion.
The chaos actions became a vital measure of progress and served as key performance indicators for our team. We reported on them regularly, leveraging the data to drive continuous improvement and track our success in strengthening system resilience and reliability.
Through this journey, I witnessed the transformative power of chaos engineering and its ability to foster collaboration between different teams. By creating a safe environment for controlled chaos and enabling cross-functional cooperation, we could identify and address vulnerabilities, streamline processes, and build stronger relationships across departments.
Incorporating chaos engineering principles into our ITSM practices allowed us to go beyond traditional testing and monitoring approaches. It empowered us to proactively challenge our systems, learn from failures, and constantly improve our ways of working.
The continuous feedback loop established through chaos days and the subsequent actions generated a culture of innovation and adaptability within our organisation.
By working together to conduct experiments, identify weaknesses, and implement improvements, relationships between teams are strengthened. This collaboration fosters a culture of continuous improvement and drives innovation across the organisation.
The value of chaos
As a service management professional, embracing chaos engineering principles can bring a multitude of benefits:
- Increased system reliability – Chaos engineering empowers ITSM professionals to proactively identify and address vulnerabilities in system architecture. By intentionally introducing controlled chaos and conducting experiments, you can uncover hidden failures and weaknesses in the system. This proactive approach allows for timely remediation, leading to increased system reliability and minimising the risk of unexpected disruptions.
- Enhanced incident response – With chaos engineering, ITSM professionals can simulate realistic outage scenarios and test the incident response capabilities of different teams and stakeholders. By running these simulation exercises, you can fine-tune incident management processes, strengthen coordination among teams, and optimise communication channels. This leads to more efficient incident response and reduced downtime during critical, real-time situations.
- Reduced business impact – By conducting chaos and addressing weaknesses beforehand, chaos engineering helps ITSM professionals minimise the impact of potential failures on the business. Uncovering vulnerabilities in advance allows for proactive measures to be taken, preventing major outages or customer impacts. This reduction in business impact translates to improved operational continuity, customer satisfaction, and overall business performance.
- Improved stakeholder confidence – Chaos engineering builds confidence in stakeholders, including business leaders, customers, and end users. By actively testing and ensuring the resilience of systems, ITSM professionals can demonstrate their commitment to delivering reliable services. Stakeholders gain assurance that potential failures have been identified and addressed, fostering trust in the organisation’s ability to provide uninterrupted services.
- Builds business relationships – Chaos engineering provides opportunities to collaborate closely with development, customer operations, and other teams involved in the system’s life cycle.
By embracing chaos engineering principles, we become a catalyst for positive change. You can actively contribute to building reliable, resilient, and customer-centric systems.
Conclusion
Embracing chaos engineering can unlock a world of possibilities for enhancing system reliability, incident response, reducing business impact, building stakeholder confidence, and fostering strong business relationships. However, it’s important to remember that there is no one-size-fits-all approach. Each organisation is unique, and you must tailor your implementation of chaos engineering principles to align with your specific business needs, your goals and, in some cases, your limitations.
One of the keys to success is making your work fun or, at the very least, not mundane. Chaos engineering introduces an element of excitement and curiosity as you navigate uncharted territory. Embrace the opportunity to learn, experiment, and grow.
Collaboration between teams is crucial when integrating chaos engineering into ITSM processes. Break down silos and foster a culture of collaboration, where different teams can come together to conduct experiments, share knowledge, and drive collective improvements.
Lastly, to fully embrace chaos engineering, you must be comfortable with being uncomfortable. This practice challenges traditional norms and encourages you to step outside your comfort zone. Embrace the unknown, push boundaries, and be willing to face the uncertainties that come with testing and experimenting.
As you embark on your chaos engineering journey, keep in mind that the path to success lies in tailoring your approach, infusing joy into your work, fostering collaboration, and embracing the discomfort of the unknown. By doing so, you will not only enhance the resilience and reliability of your systems but also transform your role as an ITSM professional and drive sustainable value for your organisation.