10/28/2021 | Press release | Distributed by Public on 10/28/2021 11:46
Testing for mishaps you can predict is essential. But with the complexity that comes with digital transformation and cloud-native architecture, teams need a way to make sure applications can withstand the "chaos" of production. Chaos engineering answers this need so organizations can deliver robust, resilient cloud-native applications that can stand up under any conditions.
Chaos engineering is a method of testing distributed software that deliberately introduces failure and faulty scenarios to verify its resilience in the face of random disruptions. These disruptions can cause applications to respond in unpredictable ways, and can break under pressure. Chaos engineers ask why.
Practitioners subject software to a controlled, simulated crisis to test for unstable behavior. The crisis could be technical, natural, or malicious events, for example an earthquake affecting data center availability, or a cyberattack infecting applications and websites. As software performance degrades or fails, the chaos engineers' findings enable developers to add resiliency into the code, so the application remains intact in an emergency.
As chaos engineers grow confident in their testing, they change more variables and broaden the scope of the disaster. The many disaster scenarios and outcomes allow chaos engineers to better model what happens to applications and microservices, which gives them increasing intelligence to share with developers to perfect software and cloud-native infrastructure.
Netflix pioneered chaos engineering out of necessity. In 2009, the purveyor of online videos migrated to AWS cloud infrastructure to deliver its entertainment to a growing audience. But the cloud brought new complexities, such as increasing connections and dependencies. It created more uncertainty than the load balancing issues the entertainment firm saw in its data centers. If any touchpoint in the cloud failed, the quality of the viewers' experience could degrade. So, the organization sought to reduce complexity and raise production quality.
In 2010, Netflix introduced a technology to switch production software instances off at random - like setting a monkey loose in a server room - to test how the cloud handled its services. Thus, the tool Chaos Monkey was born.
Chaos engineering matured at organizations such as Netflix, and gave rise to technologies such as Gremlin (2016), becoming more targeted and knowledge-based. The science has spawned specialized chaos engineers who dedicate themselves to disrupting cloud software and the on-prem systems they interact with to make them resilient. Now, chaos engineering is an established profession, stirring up managed trouble to stabilize cloud software.
Chaos engineering starts with understanding the software's expected behavior.
To mitigate damage to production environments, chaos engineers start in a non-production environment, then slowly extend to production in a controlled way. Once established, chaos engineering becomes an effective way to fine tune service-level indicators and objectives, improve alerting, and build more efficient dashboards, so you know you are collecting all the data you need to accurately observe and analyze your environment.
To learn more about how Dynatrace can help your team master chaos engineering experiments, join us for the on-demand performance clinic, Mastering Chaos Engineering Experiments with Gremlin and Dynatrace today.
Chaos engineering generally originates from small teams within DevOps, often involving applications running in both pre-production and production environments. Because it can touch many systems, chaos engineering can have broad implications, affecting groups and stakeholders across the organization.
A disruption spanning hardware, networks, and cloud infrastructure can require input and participation from network and infrastructure architects, risk experts, security teams, and even procurement officers. That's a good thing. The greater the scope of the test, the more useful chaos engineering becomes.
Athough a small team generally owns and manages the chaos engineering effort, it's a practice that often requires input from-and provides benefits to-the village.
The insights you can gain by testing the limits of your applications deliver a lot of benefits for your development teams and your overall business. Here are just a few benefits of a healthy, well-managed chaos engineering practice.
The more resilient an organization's software is, the more consumers and business customers can enjoy its services without distraction or disappointment.
Although the benefits of chaos testing are clear, it is a practice that should be undertaken with deliberation. Here are the top concerns and challenges.
As with any scientific experiment, getting started with chaos engineering requires a little preparation, organization, and the ability to monitor and measure results.
Solutions like Gremlin provide crucial management tools to plan and execute chaos engineering experiments. It makes experiments repeatable and scalable so teams can apply them to future experiments of the same or larger stacks.
Automatic and intelligent observability from Dynatrace delivers insights into the effects of chaos testing, so engineers can steer chaos experiments with care. To monitor the blast radius, Dynatrace observes the systems undergoing chaos experiments. With visibility across the full software stack, Dynatrace provides crucial contextual analysis to isolate the root cause of failures exposed by chaos testing.
Effective monitoring from Dynatrace offers an essential panoramic lens for the engineers driving chaos testing, helping them understand dependencies and predict how outages will affect the system at large. Should the chaos reach further than intended, insights from Dynatrace helps teams quickly remediate any actual harm to the application's functionality.
Organizations can achieve application resiliency in any stage of digital transformation, and chaos engineering is a great tool. However, before playing with fire, it's critical to have the right measures in place to predict and cope with the multitude of failure scenarios this approach can bring.
To hear more about chaos engineering in action, listen to the PurePerformance podcast, Chaos Engineering Stories that could have prevented a global pandemic with Ana Medina, Sr. Chaos Engineer at Gremlin.