Here we maintain assets SREs (System Reliability Engineers) use to fearlessesly and adroitly face production.
Overview
“The best and fastest way to learn a language is to live like a native”.
PROTIP: To really learn a system, troubleshoot a running production system, as an SRE (System Reliability Engineer).
Most who hold SREs today got their job out of accident. Being in the right place at the right time. SREs did not say when they were 10-year-old that they they want to grow up to be an SRE like they say they want to be an astronaut.
The job title “SRE” has existed only a few years. It’s evovlved from “System Administrator”. Google popularized the “Site Reliability Engineering” at https://sre.google/sre-book/table-of-contents/
We think that’s not an optimal way to equip people for one of the most important jobs in a company. SREs need deep and wide knowledge and skills to keep safe and alive websites and other systems critical to the operation of nearly every organization today.
We think that great SREs are built, not born.
We think what’s needed is people collaborating to advance a baseline with variations.
NOTE: Content here are my personal opinions, and not intended to represent any employer (past or present). “PROTIP:” here highlight information I haven’t seen elsewhere on the internet because it is hard-won, little-know but significant facts based on my personal research and experience.
SRE Job Descriptions (CV)
Typical job descriptions request people with a number of years of experience.
But we think a flawed metric.
Going from newbie to Junior to Senior to Master.
Your Journey to SRE Maturity
The vitality and security of your production system at work reflects your diligence at learning these stages:
-
Master basic skills: self control (to make time), Touch Typing/vim, VSCode, MacOS/Linux commands (sed, awk, jq, jsonette, etc.), Shell & Python scripting, Git and GitHub, Git and GitHub Markdown, CI/CD (GitHub Actions?), Docker, Terraform, Ansible, Helm, Kubernetes. Then there’s effective collaboration applying etiquette and tricks to using email, Slack, SMS, Zoom/Teams, etc.
-
Customize the adoption plan templates here about how to introduce and sustain the entire implementation lifecycle.
-
Study the baseline configuration assets (Terraform, Policies, GitHub Actions scripts, etc.) by reading and viewing videos.
-
Use automation to stand up baseline production instances, using the assets and steps described here. (A production system includes observability, dashboards, alerts.)
-
Trace events during baseline functional and security tests to ensure that systems continuous adhere to all policies.
-
Analyze results compared with baseline from scalability commands and runs simulating traffic (starting with a small rig) for Observability history and Chaos Engineering.
-
Ensure compatibility when making modifications among various new releases coming out all the time.
-
Conduct experiments adding components, variations and Chaos Engineering. Break something and see how quickly you can fix them (as measured by MTTR/RTO/RPO, etc.). We have contests.
-
Create tutorials for others. Mentor others.
Choice of AWS Region
“Friends don’t let friends use region us-east-1 in production”
AWS typically introduces new services in the us-east-1 (Northern Virginia) region.
Thus, that region has suffered more outages than others.
HashiCorp HashiCups demo rig
https://github.com/hashicorp/consul-k8s-prometheus-grafana-hashicups-demoapp from Sep 2020 (by Derek Strickland) contains application and dashboard definitions for the Consul Layer 7 observability with Kubernetes guide located at learn.hashicorp.com
It leverages micro-services and Consul Service Mesh to connect them all together.
It uses HashiCups, one of the standard HashiCorp demo apps.
Code to create the Hashicups app is from https://github.com/hashicorp-demoapp :
- https://github.com/hashicorp-demoapp/frontend
- https://github.com/hashicorp-demoapp/payments
- https://github.com/hashicorp-demoapp/postgres
Also the infrastructure:
- https://hub.docker.com/repository/docker/hashicorpdemoapp/traffic-simulation
- https://github.com/hashicorp-demoapp/traffic-simulation by
nicholas jackson
https://github.com/hashicorp/learn-consul-k8s-hashicups
https://github.com/hashicorp/field-demo-hashicups-sample
https://learn.hashicorp.com/tutorials/terraform/provider-setup
https://learn.hashicorp.com/tutorials/consul/kubernetes-deployment-guide
https://learn.hashicorp.com/collections/consul/kubernetes-production
https://github.com/hashicorp/terraform-provider-hashicups
https://github.com/hashicorp/learn-terraform-hashicups-provider
Terraform CDK
Production use often raises the need to handle more complexity.
So users of HashiCorp’s declarative Configuration Language (HCL) would eventually reach a point where loops, switch statements, and other complex logic would be useful.
Want to leverage the power of your existing toolchain for testing, dependency management, etc.?
https://www.terraform.io/cdktf
HashiCorp’s CDKTF (Cloud Development Kit for Terraform) enables several general programming languages to define and provision infrastructure:
- TypeScript, Python, Java, C#, and Go (experimental)
CDKTF was in beta as of May, 2022.
CDKTF provides access to the entire Terraform ecosystem, without coding HashiCorp Configuration Language (HCL).
https://www.terraform.io/cdktf/examples
CDKTF competes with Pulumi.
https://www.hashicorp.com/blog/managing-hashicorp-consul-access-control-lists-with-terraform-and-vault
https://learn.hashicorp.com/tutorials/vault/production-hardening