Wilson Mar bio photo

Wilson Mar


Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

Use Gremlin, Chaos Monkey, and monitoring tools (such as Datadog) to measure and improve MTTD and MTTR

US (English)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean


“40% of organizations will implement chaos engineering practices as part of DevOps initiatives by 2023, reducing unplanned downtime by 20%.” [Source: Gartner]

NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.


The definition “Chaos Engineering” on Wikipedia:

    Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.

Vendor Gremlin’s definition:

    Chaos Engineering" consists of thoughtful controlled experiments designed to reveal the weaknesses of systems, which results in reduction of downtime and quicker response to anomalies.


The speed to detect and respond to anomalies is a key part of the “Operational Efficiency” pillar of Well-Architected cloud frameworks by both Amazon and Microsoft.

Hypotheses of failure modes

Real world “chaos” in Virtual Machines (and how to inject failure):

  • CPU usage spike (sidecar program making complex calculations)
  • Memory RAM usage spike (sidecar program consuming memory)
  • Hard drive free space available (program consuming disk space)
  • Disk I/O (competiting)
  • Transaction latency (by proxy holding requests)
  • Network bandwidth (competing program hogs bandwidth)
  • Network connections severed (by operating system command)
  • Network TCP packet Loss
  • Specific app process killed (by operating system command)
  • System time Change (by “Time Travel” utility)
  • Server shutdown (by operating system command)
  • DNS

Potential failures possible (based on principlesofchaos.org):

  • improper fallback settings when a service is unavailable (such as the system not being in a safe state after failure)
  • retry storms from improperly tuned timeouts
  • outages when a downstream dependency receives too much traffic
  • cascading failures when a single point of failure crashes

Making it happen

Chaos Engineering is an investment in moving from a reactive to proactive approach to reliability engineering.

Instead of waiting for an outage to “see what happens”, it involves conducting experiments to expose systemic weaknesses do not become aberrant behaviors in production.

A sample Acceptance Criteria statement for work on Chaos Engineering is confidence in our production deployments despite the complexity that they represent:

  • RTO and RPO expected and architecture/processes to achieve them are defined and approved by leadership.

  • Proof that failure of key resources in each environment results in recovery within RTO and RPO timeframes.

  • Improved availability (reduced unplanned down time) and development velocity.


Preparations for Chaos Engineering effort:

  • Pitch
  • Executive sponsor. If your leadership’s attitude is to do the minimal and just recover when needed, this is not for you.
  • Team assembled and trained
  • Cloud accounts provisioned
  • Systems are created (using IaC) and running in “steady state”
  • Monitoring systems and procedures are in place to produce metrics (see below)
  • Baseline metrics


  • Availability (unplanned downtime)
  • Transaction throughput per hour/day/week/month/quarter/year
  • Latency (response time to user requests) percentiles

  • MTTD (Mean Time to Detect) - How long did it take for someone to realize there is a problem? The starting point is an event that may not be specifically logged, but inferred from other observations.

  • MTTM (Mean Time to reMediate) - How long did it take for the interruption (vulnerability) to be corrected in production?

  • MTTR (Mean Time to Repair/Recover) - How long did it take for the interruption to be repaired?

  • MTTI (Mean Total Time of Impact) to operations.

  • MTBF (Mean Time Between Failures) - How quickly and frequently engineers deploy?

  • RTO (Re time Objective)

  • RPO (re P Objective)

Several vendors offer products and SaaS services:

  • Datadog
  • Dynatrace
  • New Relic
  • Elastic

Experiment Design

Chaos engineering experiments follow an approach:

  1. Define ‘steady state’ as some measurable output of a system that indicates normal behavior.

  2. Hypothesize that this steady state will continue in both the control group and the experimental group.

  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Automation Vendors

Sure, one can inject “perturbation” by manually shutting down a server to see what happens.

Chaos engineering utilities (systems) enable more experiments to be conducted quicker, for higher coverage, with better reporting, at scale (running hundreds or thousands of servers), providing daily, weekly, monthly, and annual reports about the resiliency posture of the systems.

This article draws from several vendors.

Vendors who offer products and services to automate chaos engineering:

To maintain a postmortem to display a Postmortum Dashboard to display timelines and metrics.

  • Jira
  • “Fire Hydrant”
  • Blameless

Chaos Money from Netflix

Chaos Monkey was open-sourced in 2010 by Netflix at github.com/Netflix/chaosmonkey, written in Go and integrated for use within Spinnaker, the continuous delivery platform at Netflix. READ: Gremlin’s review of it and Netflix’s 2011 Simian Army.

Gremlin (freemium)


  • chaosnative.com, a CNCF (open source) project based on Cloud-Native Chaos Engineering.

  • Gremlin, freemium product with a GUI and professional support. It supports a wide range of operating systems.

Gremlin Certified Chaos Engineering Practitioner Exam (GCCEP) https://github.com/certificate-study-guide provides two attempts to answer 80% of 20 questions on https://gremlin.coassemble.com/

  1. Use the email link to setup an Account forever-free individual account. $750/month

    https://www.gremlin.com/gremlin-free-software/?ref=blog https://www.gremlin.com/get-started/?ref=nav

  2. https://www.gremlin.com/community/ get a Slack invite.

  3. https://app.gremlin.com/login

CNCF Litmus & ChaosNative

LitmusChaos was orginally developed for use on Kubernetes.

VIDEO: Introduction to Litmus Chaos | Rawkode Live

VIDEO Karthik S. is the maintainer of Litmus Chaos.

Documentation is at https://litmusdocs-beta.netlify.app/docs/introduction/

Gremlin’s approach

Roles for “Game Day”

  1. PROTIP: Hold a “Game Day” to replicate SEV and confirm fix is reliable:

    • General (IMOC = Incident Manager On Call) who defines the schedule, decide on abort conditions.

    • TLOC (Tech Lead On Call) stays focused on technical problem solving.

    • Commander who implements and executes experiments.

    • Scribe who records experiments and results.

    • Observer who correlates results.

    Failure Modes

  2. Review previous RCA (Root Cause Analysis) aka Known Failure Modes to define attack scenarios.

    NOTE: Gremlin’s unique value proposition is that it can turn incident reproduction results into automated scenarios Gremlin can run.

  3. Target one of your services to impose failure modes:

    • K8s Containers Orchestration
    • AWS Cloud Compute
    • Datadog monitoring
    • Messageing
    • Databases
    • ALFI (Application-Level Failure Injection), such as on AWS RDS (VIDEO)

    NOTE: Gremlin provides several “scenarios” to impose “chaos”:

    • Inbound HTTP Traffic
    • Outbound HTTP Traffic

    NOTE: If you are running on Azure and have failover to another availability center or region (GZRS), Microsoft takes care of the failover process so you shouldn’t even notice it occurred.

  4. Identify a Linux or Windows server where Gremlin can be installed:

    • Ubuntu, Debian
    • CentOS
    • RHEL
    • Docker image
    • Helm
    • Windows

  5. Add Gremlin in server build process. On Windows:

    msiexec /package https://windows.gremlin.com/installer/latest/gramlin_installer.msi
  6. Enable monitoring to measure latency, resource usage

    • CPU usage
    • Memory RAM usage
    • Disk space usage
    • Disk I/O
    • Network packet loss (simulate bandwidth limitation)

    PROTIP: Gremlin is able to target the number of cores.

  7. Set alerts to be sent via email, Slack, SMS text, etc.

  8. Set daily, weekly, monthly, annual statistical reports to be sent to a distribution list.

  9. Choose attack mode:


    • CPU usage
    • Memory RAM usage
    • Disk space usage
    • Disk I/O State:
    • Kill Process
    • Shutdown
    • Change System time (Time Travel) Network:
    • Drop traff (Blackhole)
    • DNS
    • Latency
    • Packet Loss on network

  10. Gremlin creates traffic on the network from a Redis in-memory database.

  11. Enable monitoring and alerts. Specifically, analyze latency in transactions going through the network.

    Example result: as Gremlin increases load, typically it sees levels such as:

    1. At 50 ms, the system has enough memory to absorb higher loads without degradation. However, the

    2. At 100 ms, requests begins to be queued, so response times reflect time in queue.

    3. At 300 ms, requests cannot be processed and responses reflect the handling of failed transactions.

    PROTIP: One purpose of this work is to validate monitoring configurations and the ability of monitors to identify those different levels, because different actions are appropriate at each level.

  12. Adjust monitoring and alert levels based on Gremlin runs.

    • Adjust thresholds for alerts

    • Adjust frequency of measurement recording

  13. Run Gremlin to ensure that on-call personnel respond appropriately.

    PROTIP: Measure the actual (upgraded) MTTD & MTTR (Mean Time to Detect and Repair) - How long did it take for the interruption to be detected and then repaired?

  14. Adjust report distribution lists over time automatically, if possible.


  • https://groups.google.com/g/chaos-community/c/84VOWoDQiIg






More on DevSecOps

This is one of a series on DevSecOps:

  1. DevOps_2.0
  2. ci-cd (Continuous Integration and Continuous Delivery)
  3. User Stories for DevOps
  4. Enterprise Software)

  5. Git and GitHub vs File Archival
  6. Git Commands and Statuses
  7. Git Commit, Tag, Push
  8. Git Utilities
  9. Data Security GitHub
  10. GitHub API
  11. TFS vs. GitHub

  12. Choices for DevOps Technologies
  13. Pulumi Infrastructure as Code (IaC)
  14. Java DevOps Workflow
  15. Okta for SSO & MFA

  16. AWS DevOps (CodeCommit, CodePipeline, CodeDeploy)
  17. AWS server deployment options
  18. AWS Load Balancers

  19. Cloud services comparisons (across vendors)
  20. Cloud regions (across vendors)
  21. AWS Virtual Private Cloud

  22. Azure Cloud Onramp (Subscriptions, Portal GUI, CLI)
  23. Azure Certifications
  24. Azure Cloud

  25. Azure Cloud Powershell
  26. Bash Windows using Microsoft’s WSL (Windows Subsystem for Linux)
  27. Azure KSQL (Kusto Query Language) for Azure Monitor, etc.

  28. Azure Networking
  29. Azure Storage
  30. Azure Compute
  31. Azure Monitoring

  32. Digital Ocean
  33. Cloud Foundry

  34. Packer automation to build Vagrant images
  35. Terraform multi-cloud provisioning automation
  36. Hashicorp Vault and Consul to generate and hold secrets

  37. Powershell Ecosystem
  38. Powershell on MacOS
  39. Powershell Desired System Configuration

  40. Jenkins Server Setup
  41. Jenkins Plug-ins
  42. Jenkins Freestyle jobs
  43. Jenkins2 Pipeline jobs using Groovy code in Jenkinsfile

  44. Docker (Glossary, Ecosystem, Certification)
  45. Make Makefile for Docker
  46. Docker Setup and run Bash shell script
  47. Bash coding
  48. Docker Setup
  49. Dockerize apps
  50. Docker Registry

  51. Maven on MacOSX

  52. Ansible
  53. Kubernetes Operators
  54. OPA (Open Policy Agent) in Rego language

  55. MySQL Setup

  56. Threat Modeling
  57. SonarQube & SonarSource static code scan

  58. API Management Microsoft
  59. API Management Amazon

  60. Scenarios for load
  61. Chaos Engineering