What Is Chaos Engineering And How Does It Work?

Written by Allen Victor | Jun 28, 2022 11:00:00 AM

Most software systems of contemporary large-scale businesses function at full capacity as they have to deal with complex computations across distributed system architectures. In these cases, system failures have a high likelihood with the cause of failure remaining largely elusive. Chaos engineering is a field that is finding widespread application as a reliable means to be prepared for such failures in advance.

Being able to predict system failures is actually a paradox as there is no way to accurately pinpoint the emergence of a trigger for system failure. With chaos engineering that has extended phases of chaos testing, more resilient software systems can be built to withstand complexities and heavy computational loads.

In this article, we take a deep dive into the fundamentals of chaos engineering and how it applies to the development of robust software systems. We will also look into how chaos engineering can help alleviate the uncertain and volatile conditions that arise during production.

What Is Chaos Engineering?

Chaos engineering is fundamentally the practice of manually introducing failure into a software system to eventually build its resilience with each iteration.

Although it sounds like a haphazard process there is a lot of discipline and data-drivenness behind running such experiments with failure injection.

This practice is primarily applied by Site Reliability Engineering (SRE) teams in a bid to maximize the system uptime for a seamless experience for the end-users. Chaos engineering also helps meet leading DevOps objectives related to agility in software development by avoiding roadblocks along the way.

As more planned points of failure are placed in the software system, disruptions can be identified quickly before they lead to major system downtime. In a way, by adding automation, and a few more additions to the skill set of an SRE team, chaos engineering outcomes can be achieved.

What Is The Importance Of Chaos Engineering?

Several of the conventional paradigms of system reliability testing keep reactive processes at the forefront of their resiliency-building strategies. These tend to focus more on after-the-fact actions such as incident management and restoring a system after downtime.

The chaos engineering approach aims to ultimately alleviate a system issue before it even arises. It involves building various points in the system's processes that act as a failsafe against certain predetermined system errors. The main objectives of chaos engineering include the following:

Exploring and preventing possibilities of technical debt.
Boosting the credibility among teams collaborating over a software system and more trust in the system itself.
Covering all bases with plugins and integrations in terms of identifying points or triggers of system failure.
Building system reliability and resilience and promoting more experiment-based learning for SRE teams.

Customer Success Story: Performance and scalability testing for an India based unicorn e-commerce portal

A Brief History Of Chaos Engineering

The need for chaos engineering or chaos testing tools and teams arose when technology conglomerates working with large-scale distributed internet systems needed to work out failsafe strategies for potential failures. Some of the earliest adopters of chaos engineering tools were Netflix, Facebook, and Alphabet which worked with complex system architectures and processed Yottabytes of data.

Chaos Monkey was the first initiative in chaos engineering pioneered by the Netflix Engineering Tools team in 2010 to prevent interruptions to the Netflix streaming experience. Following this, a loss of an instance in the cloud infrastructure provided by Amazon Web Services (AWS) would not cause any halts in the end-user experience.

Once the code for Chaos Monkey was published on GitHub by Netflix in 2012, several smaller chaos engineering projects burgeoned throughout the tech industry. In 2014, a dedicated role of Chaos Engineer was pitched by Netflix and it took off and was adopted as a mainstay by Big Tech corporations such as Google and Facebook.

2018 witnessed the first-ever open conference Chaos Conf, and the attendance for the same grew tenfold over the next two years. In 2020, Amazon added Chaos Engineering to the reliability pillar of their Well-Architected Framework (WAF) benchmark. All of this culminated in the world's first State Of Chaos Engineering report in 2021 which presented the unexpected growth of this practice within the tech industry.

What Are The Principles Of Chaos Engineering?

The best-case scenario when implementing chaos engineering can be achieved only by following certain principles for best practices. Over the last decade, these principles have been implemented and perfected so that today's chaos engineering teams can maximize the desired outcomes. Through the process of continuous experimentation, the following chaos engineering principles were arrived upon:

1)Benchmark Steady State

Instead of pulling apart a system's innate characteristics, the focus needs to be shifted to its quantifiable output. Short-term quantification of potential outputs serves as a weak substitute for the system's actual well-defined steady-state. The probable throughput, margin of error, and latency percentiles are some of the system-wide metrics that can be used to measure and benchmark steady-state behavior.

2)Event-Based Variables

Chaos variables should ideally represent actual occurrences taken out of real-life scenarios. These events need to be rearranged according to either their potential impact or expected frequency of occurrence. Failure-related events, such as the software system's servers failing, as well as non-failure events, such as an increase in online traffic or a scaling event, can function as variables for chaos engineering.

3)Production-Level Automated Experimentation

Manually carrying out experiments is labor-intensive and is eventually unsustainable in the long run. Systems respond in a variety of ways depending on their surroundings and traffic patterns. The only reliable approach to properly record the request flow is by sampling real traffic because behavior related to resource utilization can change at any time. Automation is integrated into the system via chaos engineering to expedite the implementation as well as analysis efforts.

4)Controlled Blast Radius

If you find a potential failure while carrying out experiments for your software system, you can call it a day and move on. If not, you need to increase the blast radius, i.e., the scope of the experimentation within the system. Production experimentation may put customers through needless suffering. It is the responsibility of the chaos engineer to make sure that the consequences of experiments are kept to a minimum and contained, even though there must be some flexibility for certain short-term adversities.

Top 5 Chaos Engineering Tools

As more Big Tech entities have been increasingly adopting the Chaos Engineering paradigm, this emerging strategy for SRE has gained a lot in terms of maturity over the last decade. Several best practices around this strategy have materialized leading to the introduction of efficient chaos engineering tools. The following are the best tools in the market today:

1)Gremlin

This was the world's first-ever managed enterprise chaos engineering service. System resiliency can be checked via at least three attack modes and can be availed as a Software-as-a-Service (SaaS) solution. Based on what the expected optimal results are, the chaos engineering team has options to determine which attack will be tested. For comprehensive assessments of software system infrastructures, Gremlin provides the option to perform tests in conjunction with other tools as well.

Pros

Run simultaneous tests for attacks
Customizable UI for various attacks
CLI, API, and UI automation are available

Cons

The full version requires licensing fee payment
Internal code is not customizable
Reporting capabilities are non-existent

2)Chaos Toolkit

The Chaos Toolkit is an intuitive Command-Line Interface (CLI) based tool helping chaos engineering teams to collaborate and run chaos experiments. As it declares and stores chaos variables as JSON and YAML files, orchestration and analysis of code are carried out like any other CLI-based coding approach. Reporting and system failure scheduling are highly customizable features within the Toolkit's procedures.

Pros

It is extensible through its open-source API
Simple embedding in any CI/CD workflow for automation

Cons

Requires working Python installation to run
Has only been tested against CPython
Needs several extensions for well-rounded interfacing

3)Chaos Mesh

Chaos Mesh is a cloud-native open-source tool that uses a wide range of possible fault simulations. This tool can determine the system aberrations throughout the development, testing, and production stages. DevOps workflows can easily embed Chaos Mesh to utilize chaos experiments within the Kubernetes environments. Network latency, resource utilization, and distributed systems are some of the system parameters that can be optimized with the help of this service.

Pros

Flat learning curve
UI supports multiple configurations
Pause/resume experiments as required

Cons

No strict scheduling of attacks
Increased security risks

4)Chaos Monkey

Chaos Monkey was introduced as the first-ever chaos engineering solution by the Netflix Engg Tools team in 2010 and triggered the chaos engineering revolution. Over the last decade, it has gone through a wide array of upgrades and has matured as a tool to prepare the software system for all kinds of continuous unpredictable attacks. You can easily schedule attacks and monitor the entire procedure on a granular level.

Pros

Easy scheduling and monitoring
No licensing costs as it is fully open-source
Trackable history

Cons

Requires understanding and writing custom Go code
Only one type of experimentation at a time and a limited blast radius

5)Litmus

It is an open-source fault injection tool for cloud-native environments and infrastructure. Litmus enables chaos engineers to identify system faults by creating and analyzing chaos within Kubernetes so that systems working with CI/CD pipelines can easily conduct failure experimentation. All the chaos experimentation, followed by remediation is carried out before the software system goes into full-scale production. Experimentation clusters can be carried out, all the while being able to pinpoint individual bugs in the internal code of the system and UI limitations.

Pros

Regular system health checks
Error detection and resiliency checks are automated
Centralized experimentation resource repository

Cons

Steep learning curve
Tracking and managing permissions is not hassle-free
Complications related to setting up multiple accounts within an enterprise

What Are The Benefits Of Chaos Testing?

Chaos Engineering or chaos testing, used interchangeably, over the last decade has developed the resiliency of large-scale cluster-based software systems. Tech giants such as Alphabet, Netflix, and Facebook have been able to enhance their streaming capabilities, improve customer experience, and reduce the time taken to remediate system faults. Some of the most frequently observed advantages include the following:

Monitoring system data before, during, and after a breakdown tends to receive more awareness. This, in turn, aids in system recovery from the actual failure, but it also offers information for further investigation and resiliency building.
By merely including a check in the system's code, the downstream failure and the severity of a crash can be reduced massively. New information on the production system becomes quickly available after the system has been restarted and after the subsequent analysis.
The level of confidence in the Mean Time To Recovery (MTTR) increases as the system's end-to-end capabilities get boosted dramatically when improvements are really made after the occurrence of each fault.
Chaos engineers can get more inventive and produce targeted yet random failures that are exclusively meant to affect a certain area of the production environment.
Faults and triggers of system downtime can be identified, analyzed, and reported without lowering system performance, preventing users from accessing a portion of the network or removing a microservice.

Chaos Testing Best Practices

The best approaches for chaos testing are developed on the basis of continuous testing and vary in terms of whether they are applied across the infrastructure, network, and application layers, as follows:

Infrastructure Layer: In this layer, chaos testing works by anticipating production faults. This is done by duplicating instances of varying availability zones, regions, etc., without making abort or roll back functions available. The production or production-like environments are the primary areas of implementation.

Network Layer: The network settings are set up in such a manner that it simulates conditions that can support the creation of deliberate faults in the connections. Before taking up the network layer, the application itself should be free from flaws but with scope for randomizing chaos. Noise across the network layer as well as interrupted connectivity can be simulated as chaos.

Application Layer: Early on in the development stages of the software system, chaos engineering principles are implemented systematically. Software Development Engineers in Test (SDET) and other developers are the primary facilitators of chaos tests in the application layer. Ops teams and product owners can also be consulted to direct the flow of the fault injection and chaos engineering initiatives in the right direction.

ALSO READ: What is Dynamic Application Security Testing (DAST)?

Conclusion

If chaos engineering is inculcated into the enterprise software development workflow, higher levels of system uptime can be ensured. The end-user experience for software becomes streamlined and uninterrupted. Chaos engineering strategies enable chaos testers and developers to have fixes already in place for unexpected rapid fluctuations in the network bandwidths, API timeouts, and quick remediation of system faults.

Streaming platforms can make sure that downloads and streaming resume from points where they are paused without delays. Implementing such continuous testing strategies and advanced software development methodologies requires frequent consultations with the right experts. Daffodil's Software Development Services are your best bet for implementing the latest approaches for ensuring high-quality software delivery.

View full post