Chaos Engineering

Use Gremlin, Chaos Monkey, and monitoring tools (such as Datadog) to measure and improve MTTD and MTTR

Overview

“40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.” [Source: Gartner]

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.

What is Chaos Engineeering?

Vendor Gremlin’s definition:

Chaos Engineering" consists of thoughtful controlled experiments designed to reveal the weaknesses of systems, which results in reduction of downtime and quicker response to anomalies.

The definition “Chaos Engineering” on Wikipedia:

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.

https://github.com/dastergon/awesome-chaos-engineering

Making bad things happen

Chaos Engineering is an investment in moving from a reactive to proactive approach to reliability engineering.

Instead of waiting for an outage to “see what happens” (at the worst possible time), it involves conducting experiments to expose systemic weaknesses that do not become aberrant behaviors in production.

Security Chaos Engineering

VIDEO compares traditional Audit and Test (aka Red/Blue/Purple Team) against modern Chaos Engineering:

Factor	Test/Audit	Chaos Engineering
By:	external contractors	in-house staff
Interfaces: Scope:	external-facing	internal and external facing
Frequency:	periodically (annually)	continuously
Tools:	manual	automated (in CI/CD, perf test)
Goals:	judgement	iterative improvement of resilence
Expected Outcome:	confirmation of posture	high definition insights about processes
Objective:	not for learning	create learning opportunities
Beneficiaries:	Security & Ops	company-wide Incident Management
-	NOT cloud native	cloud native

Azure enablement show: “Under Chaos Engineering”

Chaos Engineering https://aka.ms/azenable/79/01
Understanding Chaos Engineering and Resilience https://aka.ms/azenable/79/02

Hypotheses of failure modes

Real world “chaos” in Virtual Machines (and how to inject failure):

Misconfiguration of network and server resources (in Terraform HCL, CloudFormation, etc.)
System time Change (by “Time Travel” utility)
CPU usage spike (sidecar program making complex calculations)
Memory RAM usage spike (sidecar program consuming memory)
Hard drive free space available (program consuming disk space)
Disk I/O (competiting)
DNS resolution failure (by operating system command)
Transaction latency (by proxy holding requests)
Network bandwidth (competing program hogs bandwidth)
Network connections severed (by operating system command)
Network TCP packet Loss
Specific app process killed (by operating system command)
Server shutdown (by operating system command)

Potential failures possible (based on principlesofchaos.org):

Single point of failure (SPOF) crashes with no fallback
Improper or ineffective fallback settings when a service is unavailable (such as the system not being in a safe state after failure)
Retry storms from improperly tuned timeouts
Cascading outages when a downstream dependency receives too much traffic

See Failure Modes (below).

Metrics

The speed to detect and respond to anomalies is a key part of the “Operational Efficiency” pillar of Well-Architected cloud “best practice” implementation and evaluation frameworks by Amazon, Microsoft, and Google.

A sample Acceptance Criteria statement for work on Chaos Engineering is “confidence in our production deployments” despite the complexity that they represent.

Specific metrics to consider:

Availability (unplanned downtime per year/month/week/day/hour). Components of this include:
Transaction throughput per hour/day/week/month/quarter/year
Latency (response time to user requests) percentiles
MTTD (Mean Time to Detect) - How long did it take for someone to realize there is a problem? The starting point is an event that may not be specifically logged, but inferred from other observations.
MTTM (Mean Time to reMediate) - How long did it take for the interruption (vulnerability) to be corrected in production?
MTTI (Mean Total Time of Impact) to operations.
MTBF (Mean Time Between Failures) - How quickly and frequently engineers deploy?
RTO (Recovery Time Objective) aka MTTR (Mean Time to Repair/Recover) - How long for interruptions to be repaired?
RPO (Recovery Point Objective) - how far back data can be recovered. If there is dependence on recovery from backups, the RPO would be the time between backups are taken, which can be a day.

Monitoring Vendors

Vendors offering products and SaaS services:

Datadog
Dynatrace
New Relic
Elastic
Splunk
etc.

PROTIP: Summarized metric reports provide executives of an enterprise the resiliency posture of its systems.

Preparations and efforts

Steps in a Chaos Engineering effort:

Define current organization structure (teams) and participant contacts
Define organizational goals and objectives (such as “reduce unplanned downtime by 20%”)
Define metrics to measure results SLAs (Uptime, Availability, MTTD, and MTTR)
Experiment designed (using Gremlin) with hypothesis of failure modes.
Scenarios how to setup the environment and inject failure (regular reliability tests)
Observability tools (Datadog, New Relic, etc.) installed after user training
Metrics to measure,
Periodic health check
Alert levels (using Pager Duty)
Abort conditions,
Define runbooks to define/standardize response to chaos
Sample reports to be generated.
Pitch executives to get buy-in (this involves an “elevator pitch”, “business case”, and “proof of concept”)
Executive sponsor. If your leadership’s attitude is to do the minimal and just recover when needed, this is not for you
Budget and objectives approved by management
Chaos Engineering team leads (champions) commissioned
Periodic (weekly, monthly, quarterly, annually) reporting to management defined
Teams assembled
Accounts with permissions provisioned with budget
Training conducted and learning verified (certifications)
Communication channels (Slack, email, etc.) established and tested
Team trained on how to use start/abort scenarios that inject failure (Gremlin)
Systems can be created repeatedly (using IaC) for running in stable “steady state”
Artificial load generation
Install monitoring systems and procedures (currently in place) to produce “as is” baseline metrics (see below)
Analyze baseline metrics with visual analytics to identify and demonstrate “weaknesses” as “opportunities”
Define plan of action (design experiments)
Implement plan of action (conduct experiments on Game Days)
Analyze evolving metrics to determine if the plan of action is working, and adjust as necessary
Define lessons learned and updated best practices, scenarios, tools
Draft reports to management

Experiment Design

Chaos engineering experiments follow an approach:

Define steady state as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group. Ask how will the organization and systems respond to certain faults?
Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc. Setup Observability tools to measure the impact of the variables on the steady state of the system.
Try to disprove the hypothesis by looking for a difference in steady-state between the control group and the experimental group.

Chaos Automation Vendors

Sure, “perturbations” can be injected manually on a CLI, such as a server shut down command, to see what happens.

Chaos engineering utilities (systems) enable more experiments to be conducted quicker, for higher coverage, with better repeatability, at scale (running hundreds or thousands of servers), providing daily, weekly, monthly, and annual reports.

This article draws from several vendors.

The timeline at the top of this page depicts vendors who offer products and services to automate chaos engineering:

“Chaos Monkey” from Netflix
Gremlin (freemium)
CNCF Litmus with services by ChaosNative
AWS Fault Injection Simulator
others

“Postmortem Dashboards” display timelines and metrics are presented by these vendors to help teams learn from failures:

Jira
“Fire Hydrant”
Blameless

Chaos Money from Netflix

Chaos Monkey was open-sourced in 2010 by Netflix at github.com/Netflix/chaosmonkey, written in Go and integrated for use within Spinnaker, the continuous delivery platform at Netflix. READ: Gremlin’s review of it and Netflix’s 2011 Simian Army.

Gremlin (freemium)

https://www.gremlin.com/docs/application-layer/authentication-configuration/

chaosnative.com, a CNCF (open source) project based on Cloud-Native Chaos Engineering.
Gremlin, freemium product with a GUI and professional support. It supports a wide range of operating systems.

GECEC

Gremlin Enterprise Chaos Engineering Certification (GECEC) online course is rated at 1 h 30m over 6 modules and includes a quiz with no time limit to pass 80% of 30 questions, given 3 attempts.

GCCEP

Gremlin Certified Chaos Engineering Practitioner Exam (GCCEP) https://github.com/certificate-study-guide provides two attempts to answer 80% of 20 questions on https://gremlin.coassemble.com/

Use the email link to setup an Account forever-free individual account. $750/month

https://www.gremlin.com/gremlin-free-software/?ref=blog https://www.gremlin.com/get-started/?ref=nav
https://www.gremlin.com/community/ get a Slack invite.
https://app.gremlin.com/login

CNCF Litmus & ChaosNative

LitmusChaos was orginally developed for use on Kubernetes.

VIDEO: Introduction to Litmus Chaos | Rawkode Live

VIDEO Karthik S. is the maintainer of Litmus Chaos.

Documentation is at https://litmusdocs-beta.netlify.app/docs/introduction/

Network Chaos

Toxiproxy is a tool from Spotify for chaos network engineering. It is a proxy server that simulates many kinds of network misbehavior.

Roles for “Game Day”

PROTIP: Hold a “Game Day” to replicate SEV and confirm fix is reliable:
- General (IMOC = Incident Manager On Call) who defines the schedule, decide on abort conditions.
- TLOC (Tech Lead On Call) stays focused on technical problem solving.
- Commander who implements and executes experiments.
- Scribe who records experiments and results.
- Observer who correlates results.
Failure Modes
Review previous RCA (Root Cause Analysis) aka Known Failure Modes to define attack scenarios.

NOTE: Gremlin’s unique value proposition is that it can turn incident reproduction results into automated scenarios Gremlin can run.
Target one of your services to impose failure modes:
- K8s Containers Orchestration
- AWS Cloud Compute
- Datadog monitoring
- Messaging
- Databases
- ALFI (Application-Level Failure Injection), such as on AWS RDS (VIDEO)
NOTE: Gremlin provides several “scenarios” to impose “chaos”:
- Inbound HTTP Traffic
- Outbound HTTP Traffic
NOTE: If you are running on Azure and have failover to another availability center or region (GZRS), Microsoft takes care of the failover process so you shouldn’t even notice it occurred.
Identify a Linux or Windows server where Gremlin can be installed:
- Ubuntu, Debian
- CentOS
- RHEL
- Docker image
- Helm
- Windows

Add Gremlin in the server build process. On Windows:

msiexec /package https://windows.gremlin.com/installer/latest/gramlin_installer.msi

Enable monitoring to measure latency, resource usage
- CPU usage
- Memory RAM usage
- Disk space usage
- Disk I/O
- Network packet loss (simulate bandwidth limitation)
PROTIP: Gremlin is able to target the number of cores.
Set alerts to be sent via email, Slack, SMS text, etc.
Set daily, weekly, monthly, and annual statistical reports to be sent to a distribution list.
Choose attack mode:

Resource:
- CPU usage
- Memory RAM usage
- Disk space usage
- Disk I/O State:
- Kill Process
- Shutdown
- Change System time (Time Travel) Network:
- Drop traff (Blackhole)
- DNS
- Latency
- Packet Loss on network
Gremlin creates traffic on the network from a Redis in-memory database.
Enable monitoring and alerts. Specifically, analyze latency in transactions going through the network.

Example result: as Gremlin increases load, typically it sees levels such as:
1. At 50 ms, the system has enough memory to absorb higher loads without degradation. However, the
2. At 100 ms, requests begin to be queued, so response times reflect time in queue.
3. At 300 ms, requests cannot be processed and responses reflect the handling of failed transactions.
PROTIP: One purpose of this work is to validate monitoring configurations and the ability of monitors to identify those different levels, because different actions are appropriate at each level.
Adjust monitoring and alert levels based on Gremlin runs.
- Adjust thresholds for alerts
- Adjust frequency of measurement recording
Run Gremlin to ensure that on-call personnel respond appropriately.

PROTIP: Measure the actual (upgraded) MTTD & MTTR (Mean Time to Detect and Repair) - How long did it take for the interruption to be detected and then repaired?
Adjust report distribution lists over time automatically, if possible.

https://groups.google.com/g/chaos-community/c/84VOWoDQiIg

Azure Chaos Studio

https://www.youtube.com/watch?v=AQl_zx6NFfU

Azure Chaos Studio Preview is a managed service that uses chaos engineering to help you measure, understand, and improve your cloud application and service resilience - to handle and recover from disruptions.

Why? “Improve application resilience by introducing faults and simulating outages”

John Savill’s video on Azure Chaos Studio
https://azure.microsoft.com/en-us/products/chaos-studio
https://learn.microsoft.com/en-us/azure/chaos-studio/
https://learn.microsoft.com/en-us/azure/chaos-studio/chaos-studio-overview
https://azure.microsoft.com/en-us/pricing/calculator/?service=chaos-studio
https://azure.microsoft.com/en-us/pricing/details/chaos-studio/

References

“Safety differently” visionary (and airline pilot) Sydney Dekker VIDEO: “Drift into Failure: From Hunting Broken Components to Understanding Complex Systems” (publisher: Rutledge. $45 on Amazon)

https://neelanjanmanna.medium.com/a-beginners-practical-guide-to-containerisation-and-chaos-engineering-with-litmuschaos-2-0-eeb2ba859189

https://neelanjanmanna.medium.com/a-beginners-practical-guide-to-containerisation-and-chaos-engineering-with-litmuschaos-2-0-5f4f3cf2a55d

https://theqalead.com/podcast/gremlin-in-the-machine-how-to-achieve-chaos-engineering-netflix-amazon/

https://medium.com/the-cloud-architect/what-is-aws-fault-injection-simulator-and-why-you-should-care-3fbe457ca227

https://www.harness.io/blog/chaos-engineering-with-jenkins

More on DevSecOps

This is one of a series on DevSecOps:

Wilson Mar