Monitoring infrastructure

Toward Observability

Overview

Lifetime of system
Usage Demographics
Expectations
Measurements
Levels of monitoring investment
Minimal investment
- Level 0 -
- Level 1 - Reactive Monitoring
Dashboards
Resources

“The database is a cache of the logs.”

Different vendors have varying definitions about “Observability”.

Lifetime of system

QUESTION [1] - What is the expected lifetime of this system? (based on existing systems in the organization)

Example: 10 years * 50 weeks per year = 500 weeks

500 weeks * weekly views = Lifetime views

Usage Demographics

QUESTION [2] - Over the lifetime of this system, about how often (per week) do measurements of the system need to be considered by each persona in the company?

Example for a hypothetical 1,000-user scenario, where % Each is the percentage of total users grouped into that persona.

Persona	% Each	# Users	#/Week	# Lifetime Views [1]	Notes
Developers	55%	550	30	15,000	-
Operations	15%	150	70	35,000	-
Security	17%	170	60	30,000	-
Managers	10%	100	15	7,500	-
Auditors	2%	20	2	1,000	-
Others	6%	60	5	2,500	-
Total	100%	1000	182	91,000	-

The number of Lifetime accesses is used to understand how much history will be stored.

QUESTION [_] - What is the corporate data retention policy which specifies how long historical data is maintained?

Expectations

QUESTION [3] - What expectations for Availability (Uptime) do each persona have of the system?

Persona	Uptime % Expectation
Developers	95% +
Operations	99% +
Security	99% +
Managers	80% +
Auditors	99% +
Others	85% +

Measurements

QUESTION [4] - What monitoring is desired?

Examples of usage volume statistics:

Cloud Regions
Cloud resource components (Nodes, ELB)
Files
User logins
User Groups
Number of accesses by each persona over time [2]
Rate of users abondoning workflows

Example of limits imposed by cloud vendors:

App executable size bytes on disk and when loaded in RAM
Number of endpoints to the AWS API gateway REST API service - errors after 300 endpoints.
Response time of Lambda API - if more than 30 seconds, a user of your service receives a 504 HTTP Error even though that Lambda transaction is still running in the background.
The AWS API gateway can have 10,000 (RPS limit) x 29 (timeout limit) = 290,000 open connections. Growth beyond that result in 429 “too many requests”.
There are many other AWS service quotas

Example of “shallow” incident management metrics:

Number of incidents to investigate
TTD (Time To Detect) issue
TTR (Time to Remediate/Repair) issue
Frequency of incidents
Severity level of events

Examples of computing resource usage statistics (at various points in time):

CPU
Memory
Storage usage
Disk I/O
Network I/O

PROTIP: Monitoring queues enables identification of bottlenecks which delay response time. Capacity bottlenecks are not revealed until a queue builds up when additional simultaneous users during load tests.

Examples of app-specific statistics, for Consul:

KV count
Intentions
ACLs
etc.

Unit Cost measurements provide data for business-level assessments when costs are measured against outputs:

What does it cost to process each request/transaction, by type of process?
How much of a difference is the cost of a new request vs. existing request?
What is the cost per change request?

Productivity measurements reveal improvements in terms of speed and effort.

Cycle time – How long does it take to complete a unit of work, in total, and by phase?
- Break it out by new vs. maintenance work (most will be maintenance)
- Average time to create new document/message
- Average time to change existing document/message
Dev. “Productivity” - Requests completed per person – What number of new and/or change requests per person per period?
- Requests by new vs. maintenance work (most will be maintenance)
- Average change requests completed per person
- Average new requests completed per person
Dev. Velocity - How many content edit cycles (iterations) are needed? This can have a significant impact on cycle time and overall productivity.
- Iterations per new release
- Iterations per maintenance release
PROTIP: Less churn and resulting delays occur in firms with alignment on standards, customer experience objectives, and roles responsible.

Levels of monitoring investment

The discussion below traverses from left to right through this table:

Level of monitoring:	1 - Shallow Individual Component Monitoring	2 - In-Depth Monitoring on Different Levels	3 - Next Generation Monitoring	4 - Observability
Investment (TCO):	Minimal [5]	~3X	~10X	~20X
Approach:	Reactive collections	In-depth trends	Responsive alerts	Proactive self-healing
Mechanism:	Each component monitored using a different independent approach	Dashboards display various components of each specific service	Interdependence among components apparent when a specific component is changed for manual RCA	Anomalous behavior automatically identified for automated RCA and remediation, using Machine Learning (AI), before a crisis
Deficiency:	Extensive and frequent manual investigation [6]	Issues impacting the entire IT stack not apparent	Manual remediation necessary	Human oversight and tuning still needed
Achievable (Uptime):	< 65%[6]	< 85%	< 90.0%	< 99.5%

Minimal investment

QUESTION: [5] What do you consider to be the minimal investment in monitoring infrastructure?

Level 0 -

Level 0 is to use only commands provided by default by the operating system being used.

Level 1 - Reactive Monitoring

Level 1 is to install FOSS monitoring utilities such as Prometheus with Grafana dashboards or Elastic Stack.

staff a monitoring team

"reactive" monitoring system

every incident

Level 2 is to install licensed monitoring utilities such as Datadog, Splunk, New Relic, Honeycomb, etc.

Level 3 is to add an enterprise-wide end-to-end “capabilities dashboard” plus tracing to a single user’s impact across components, plus PagerDuty or XMatters for smart alerting/incident management.

"span"

attributes

instrumentation

framework

Heimdall Application Dashboard

VIDEO

Level 4 is to install Machine-Learning (AI) based “Observability” systems such as InsightFinder, StackState, etc.

Instrumentation is necessary to determine whether system are able to achieve scalability and availability objectives (and when they are not). Measurable reliability goals enable people to know when to stop tuning efforts. However, there are financial and operational tradeoffs for adopting each level of instrumentation. It is convenient and less stressful when systems are self-healing. But can the organization afford the time and expense to achieve that convenience? One can over-invest in achieving the ultimate level of observability. Over-investment in monitoring drains focus from feature development work. On the other hand, under-investment in monitoring can reduce development teams to “spinning their wheels” on guesses that waste precious time and create frustration which leads to turnover.

Dashboards

Resources

[1] https://f.hubspotusercontent20.net/hubfs/516677/monitoring-maturity-model-2020.pdf

https://www.hashicorp.com/blog/improve-observability-with-opentelemetry-and-consul-service-mesh

Wilson Mar