Technology loves order. Software structures – and indeed hardware systems and miroprocessor chipset architectures – work best when they are shaped to a defined order, when they align to a codified structure and when they conform to a stipulated syntax so that they slot into place without any scuffing around the edges. With EU regulations such as DORA and NIS2 approaching (and with US equivalents already in place and with surely more to come), every business from banks to software providers is under pressure to bolster their service reliability and conform to agreed structures and best practices. If not, they could face non-compliance fines.
In the face of new artificial intelligence offerings, there is a widespread consensus over the proposition that organizations should be automating many of their core technology processes and setting up automatic failover policies so applications can quickly recover from a regional infrastructure outage. To maximize resilience, organizations must also be able to automatically rollback to the last working version of their software if something breaks.
But despite the need for ordered management and strictly enforced controls at every level of the software stack, there are instances where we can embrace some rather more chaotic approaches of technology. Martin Reynolds, field CTO at AI-driven continuous integration DevOps platform company Harness says that one of the most impactful solutions could be found in an unlikely place: chaos engineering.
What Is Chaos Engineering?
Not quite as scary as it sounds, chaos engineering is a scientific method and software application development practice that involves the injection of (known and controlled) failures into an application codebase or its infrastructure stack.
Software engineers are then able to observe and validate the behavoir of the system as it buckles and snaps (or not) and then tweak the software’s structure and execution so that real world failures hopefully never end up manifesting themselves to users. As we would expect with any given scientific method, the process of chaos engineering centralizes on conducting experiments and laying down hypotheses to then compare the results to a control factor or “steady state” of the software system in this case.
“Netflix popularized the idea of chaos engineering back in 2008, that is, the idea of intentionally introducing controlled failures into a system to see how it fares and identify any weaknesses or vulnerabilities,” explained Reynolds. “After a three-day-long outage caused by a major database corruption, Netflix moved from a monolithic homegrown architecture to a distributed cloud set-up via AWS. One of the first systems that engineers built was called the Chaos Monkey – designed to randomly kill instances and services within Netflix’s architecture and test its ability to maintain services when something goes wrong.”
This insight into how systems behave under defined failure scenarios is argued to equip teams to understand weak links in their applications and infrastructure. It is also a way to proactively fix these issues. As a result, they can minimize the chances that unexpected problems could render business-critical services unavailable to end users.
“At its best, chaos engineering can reduce costly downtime and speed up incident reporting time. Despite this, since its conception, developers have feared that things would break if services were shut down as part of the chaos engineering process itself,” said Harness’ Reynolds. “As a result, the approach has never seen wide adoption amongst most development teams. All too often, a tiny proportion of developers are assigned to chaos engineering – too few to have a sizable impact on the business. What’s more, chaos engineering is often applied far too late in the delivery process, limiting the value it can bring.”
To reap the benefits of this approach, enterprise organizations probably need to break this cycle of fear and misunderstanding and think about how recent advances in automation, AI and the rise of internal developer portals can help to enhance the approach.
Give Chaos A Chance
As advancements in AI enable smaller teams to automate more processes, chaos engineering is argued to become far more impactful. For example suggests Reynolds, AI can automate service discovery, giving engineers a complete picture of the applications and infrastructure underpinning their services. This extensive coverage will enable chaos engineers to automate 80% of the basic tests needed to gauge the reliability of their underlying infrastructure and code, such as application availability and response time.
“With the more common issues identified automatically, engineers can divert their efforts onto higher value chaos activities,” said Reynolds. “They have more time available to experiment with application specific use cases, trying out different (and sometimes unusual) scenarios to identify and solve more unexpected issues. For example, testing to see if an application can handle a 100x surge in user logins at the same time, or if an application will process multiple or erroneous entries to a database field.”
AI can also act as an assistant to help developers spin up chaos experiments more easily. For example, the Harness team say they have overseen work on a generative AI-powered natural language interface that can help developers to cut down the time and cognitive burden required to write experiments. The proposition here is that enterprises can scale these practices even further across their organization by auto-populating the chaos scenarios they build into their IDP. This brings chaos testing into the mainstream, as other development teams across the organization can more easily reuse previous scenarios, helping them to evaluate the resilience of new features as they are introduced.
From Chaos To Resilience
The potential for chaos engineering has grown. As enterprises face downtime costs, growing regulatory pressure and increasingly complex environments, they have an opportunity to harness chaos and achieve what Harness calls “continuous resilience” today.
“By integrating chaos engineering practices as part of their central software delivery platform, organizations can build resilience into every aspect of the software application development lifecycle, from continuous integration and continuous deployment to GitOps, and observability workflows. This allows for more comprehensive resilience testing. From here, enterprises can use AI to supercharge chaos engineering and gain a huge head start when building reliable apps and services,” concluded Reynolds.
Not quite as esoteric as it may initially appear, chaos engineering is well-known among the software engineering and data science cognoscenti and it enjoys a respected status among those technicians who appreciate its perhaps slightly outer worldly perception by the business community. Under control, chaos engineering can be a fundamentally positive force for good. All we need to do is make sure the right person is in charge to captain this effort… now who could that be?
link