Collect metrics (for visualization by Grafana), analyze using PromQL coding, and identify alerts, free from CNCF, especially for Kubernetes
Overview
The name Prometheus comes from Greek mythology. The Titan Prometheus was an immortal servant of the gods, who stole fire and gave it to humankind. This changed the human race forever (for better and worse). But this made mankind dangerous to the gods. Ridley Scott named his 2012 film “Prometheus”, saying: “It’s the story of creation; the gods and the man who stood against them.”
Unlike the legacy “statsd” daemon which is concerned only with system-level metrics such as CPU, Memory, etc., the tool Prometheus (at https://prometheus.io) gathers metrics from targets at the cluster, node, and microservice API levels.
Prometheus has a run service that pulls or scrapes (gathers) metrics on target hosts and applications using instrumention job exporters or other custom metric providers to expose metrics, either directly or via an intermediary push gateway for short-lived jobs.
In addition to static configurations, Prometheus can also discover targets to monitor with its Services Discovery.
Prometheus stores scraped samples locally in its own multi-dimensional numeric time-series database. Unlike central data collectors (such as Splunk), each Prometheus server runs as distributed standalone and thus not dependent on network storage or other remote services. So it’s available even when other parts of the infrastructure are broken.
Rules running in the Prometheus database either aggregate and record new time series from existing data.
Promethus provides multiple modes of graphing and dashboarding support, but also exposes its time-series data to API clients such as Grafana which make PromQL (Prometheus query language) to extract data in order to display visualizations on their websites.
Because people can’t be always watching such screens, Rules are also set in Prometheus to trigger alerts pushed to the Alert Manager which notifies end-points such as email, Slack, Pager Duty SMS, or other notification mechanisms.
PCA Exam
The $250 90-minute Prometheus Certified Associate (PCA) exam at https://training.linuxfoundation.org/certification/prometheus-certified-associate/ is based on the PaC (Project Forethought) application, which is a simple to-do list program written in Node.js. It is Dockerized and deployed to a virtual machine. The application is instrumented with Prometheus client libraries to track metrics across the app. The exam’s domains:
18% Observability Concepts
- Metrics
- Understand logs and events
- Tracing and Spans
- Push vs Pull
- Service Discovery
- Basics of SLOs, SLAs, and SLIs
20% Prometheus Fundamentals
- System Architecture
- Configuration and Scraping
- Understanding Prometheus Limitations
- Data Model and Labels
- Exposition Format
28% PromQL
- Selecting Data
- Rates and Derivatives
- Aggregating over time
- Aggregating over dimensions
- Binary operators
- Histograms
- Timestamp Metrics
16% Instrumentation and Exporters
- Client Libraries
- Instrumentation
- Exporters
- Structuring and naming metrics
18% Alerting & Dashboarding
- Dashboarding basics
- Configuring Alerting rules
- Understand and Use Alertmanager
- Alerting basics (when, what, and why)
A PCA digital credential ensures the candidate understands how to use observability data to improve application performance, troubleshoot system implementations, and feed that data into other systems.
Sample app
The $299 course “Monitoring Infrastructure and Containers with Prometheus” (LFS241) uses the PaC (Project Forethought) application, which is a simple to-do list program written in Node.js. It is Dockerized and deployed to a virtual machine. The application is instrumented with Prometheus client libraries to track metrics across the app.
- Course Introduction
- Introduction to Systems and Service Monitoring
- Introduction to Prometheus
- Installing and Setting Up Prometheus
- Basic Querying
- Dashboarding
- Monitoring Host Metrics
- Monitoring Container Metrics
- Instrumenting Code
- Building Exporters
- Advanced Querying
- Relabeling
- Service Discovery
- Blackbox Monitoring
- Pushing Data
- Alerting
- Making Prometheus Highly Available
- Recording Rules
- Scaling Prometheus Deployments
- Prometheus and Kubernetes
- Local Storage
- Remote Storage Integrations
- Transitioning From and Integrating with Other Monitoring Systems
- Monitoring and Debugging Prometheus
Learning Environment
The “DevOps Monitoring Deep Dive” video course by Elle Krout references an interactive Lucid diagram called “ProjectForethought” for the NodeJs simple to-do list program called Forethought that is the subject of monitoring.
- Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host.
- When “READY”, click the Distribution name “DevOps Monitoring Deep Dive” for details.
- Highlight and copy the Temp. Password by clicking the copy icon.
- Click “Terminal” to open another browser window.
- Type “cloud_user” to login:
- Paste the password.
- For a new password, I paste the password again, but add an additional character.
-
Again to confirm.
-
When an environment is opened, highlight and copy this command:
bash -c "$(curl -fsSL https://raw.githubusercontent.com/wilsonmar/DevSecOps/master/Prometheus/prometheus-setup.sh)"
- Copy the password to your computer’s Clipboard.
- Switch to the Terminal to paste, which runs the script.
-
Paste the password when prompted.
-
To rerun the script, discard the current instance and create a new instance.
The script is self-documented, but below are additional comments:
Below is a description of Docker
-
Confirm the creation of the existing Docker image:
docker image list
The response lists “forethought” as a Docker image.
-
List the contents of the forethought directory and subdirectories:
ls -d
-
Deploy the web application to a container. Map port 8080 on the container to port 80 on the host:
docker run –name ft-app -p 80:8080 -d forethought
-
Check that the application is working correctly by visiting the server’s provided URL.
In the script, this is done using a curl script and examining the HTML response.
-
Install
NOTE: The Terminal is inside a Dockerized Ubuntu (18.04 Bionic Beaver LTS) image. So
apt-get
commands are used to install Prometheus, Alertmanager, and Grafana.The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.
Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.
Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.
The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.
Kubernetes
Prometheus joined the CNCF (Cloud Native Computing Foundation) in 2016 as its second hosted project after Kubernetes. So naturally, Prometheus works with K8s. See https://github.com/kayrus/prometheus-kubernetes.
In late 2016, CoreOS introduced the Operator pattern and released an example using that pattern in Prometheus Operatorn. It automatically creates/configures/manages Prometheus monitoring instances in clusters atop Kubernetes. See https://github.com/coreos/prometheus-operator and https://devops.college/prometheus-operator-how-to-monitor-an-external-service-3cb6ac8d5acb
PROTIP: Prometheus has not reached “1.0” yet so use of apt-get, yum, brew, installer packages are not recommended at this time for production use. But that hasn’t stopped many from using it in production.
$ cd /tmp $ wget https://github.com/prometheus/prometheus/releases/download/v2.2.0/prometheus-2.2.0.linux-amd64.tar.gz $ tar -xzf prometheus-2.2.0.linux-amd64.tar.gz $ sudo chmod +x prometheus-2.2.0.linux-amd64/{prometheus,promtool} $ sudo cp prometheus-2.2.0.linux-amd64/{prometheus,promtool} /usr/local/bin/ $ sudo chown root:root /usr/local/bin/{prometheus,promtool} $ sudo mkdir -p /etc/prometheus $ sudo vim /etc/prometheus/prometheus.yml $ promtool check config prometheus.yml Checking prometheus.yml SUCCESS: 0 rule files found $ prometheus --config.file "/etc/prometheus/prometheus.yml" &
Ansible installer
Paweł Krupa (@paulfantom, author of the Docker Workshop) and Roman Demachkovych (@rdemachkovych), together as Cloud Alchemy, defined a presentation about their Ansible role for Prometheus, with https://demo.cloudalchemy.org.
- Zero-configuration deployment
- Easy management of multiple nodes
- Error checking
-
Multiple CPU architecture support
- versioning
- system user management
- CPU architecture auto-detection
- systemd service files
- linux capabilites support
- basic SELinux (Security-Enhanced Linux) security module support
https://travis-ci.org/cloudalchemy/demo-site
Starting Prometheus
To run Prometheus after downloading the Docker image from the “prom” account in Dockerhub:
docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus.yml prom/prometheus
Start Docker and try again if you get this error message:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
The expected message is:
msg="Server is ready to receive web requests."
The location of the database and the retention period are controlled by command line options: Add –storage.tsdb.path for another path. Add –storage.tsdb.retention to specify another retention period than the default 15d (days).
-
Open a browser to see the log at:
There is no menu item to view the page.
-
Open a browser to see the Graph at the URL home page:
The above example are metrics for the Go language/virtual machine running locally.
NOTE: https://prometheus.io/docs contains docs. It says in 2012 SoundCloud wrote Prometheus in Golang and open sourced it at https://github.com/prometheus.
Graphing specs
-
TODO: Select “go_gc_duration_seconds” for the median, which is 50th quantile, specified as:
rate(prometheus_tsdb_head_samples_appended_total[1m])
Also:
go_gc_duration_seconds{instance="localhost:9090",job="prometheus",quantile="0.5"}
- Press Execute.
-
Click “Graph”.
Notice the granularity of timing on the horizontal axis. Thousands of a second.
Configuring Prometheus.yml
-
Open a browser to http://localhost:9090/config
prometheus.yml is the configuration file that contains these blocks: global, rule_files, and scrape_configs. Optionally, there are remote_read, remote_write, alerting.
global: evaluation_interval: 15s scrape_interval: 15s scrape_timeout: 10s external_labels: environment: localhost.localdomain
In the global block, scrape_interval specifies the frequency of 15s (seconds) which Prometheus scrapes targets. (The default for this is every 60 seconds)
The default evaluation_interval of 15s controls how often Prometheus evaluates rule files that specify creation of new time series and generation of alerts.
Its uniqueness is a rules engine that enables alerts by the Prometheus Alertmanager installed separately.
Recording rules enable precompute of frequent and expensive expressions and to save their result as derived time series data.
Scrape configs
This defines the job that scrapes the Prometheus web UI:
scrape_configs: - job_name: 'prometheus' metrics_path: "/metrics" static_configs: - targets: ['localhost:9090'] - job_name: node file_sd_configs: - files: - "/etc/prometheus/file_sd/node.yml"
There can be several jobs named in a config, named x, y, and z in the sample config file.
Local start
Alternately,
PROTIP: Using /etc/prometheus would require sudo, but ~/.prometheus would not.
-
Create a folder to hold the Prometheus configuration file, then CD to it:
cd ~ ; mkdir .prometheus ; cd .prometheus
-
Create a Prometheus configuration file in the folder or copy in then edit a full template example at:
https://github.com/prometheus/prometheus/blob/release-2.3/config/testdata/conf.good.yml
-
Validate yaml syntax online:
https://github.com/adrienverge/yamllint
-
Validate for content using the promtool in the Prometheus bin folder:
promtool check config prometheus.yml
An example error message is:
Checking prometheus.yml FAILED: parsing YAML file prometheus.yml: yaml: line 13: did not find expected '-' indicator
The expected response is: “SUCCESS: 0 rule files found”.
-
To run Prometheus locally in the directory containing the Prometheus binary:
<pre><strong>./prometheus --config.file=prometheus.yml</strong></pre>
Additional parameters, for example:
level=info ts=2017-10-23T14:03:02.274562Z caller=main.go:216 msg="Starting prometheus"...
Althugh an Alertmanager is not required to run Prometheus,…
Command
# Ansible managed file. Be wary of possible overwrites. [Unit] Description=Prometheus After=network.target [Service] Type=simple Environment="GOMAXPROCS=1" User=prometheus Group=prometheus ExecReload=/bin/kill -HUP $MAINPID ExecStart=/usr/local/bin/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --storage.tsdb.path=/var/lib/prometheus \ --storage.tsdb.retention=30d \ --web.listen-address=0.0.0.0:9090 \ --web.external-url=http://demo.cloudalchemy.org:9090 SyslogIdentifier=prometheus Restart=always [Install] WantedBy=multi-user.target
App Metrics
The four golden signals of monitoring begins with:
-
Latency
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.
-
Traffic
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.
To identify bottlenecks, instead of beginning with given metrics (partial answers) and trying to work backwards, the Utilization Saturation and Errors (USE) Method by Brendan Gregg (of Netflix), described at http://www.brendangregg.com/usemethod.html, begins by posing questions off a checklist, and then seeks answers. To direct the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors.
-
Utilization
the average time that the resource was busy servicing work.
-
Errors
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.
-
Saturation
How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.
Predictive: saturation is the basis for projections of impending issues, such as “at the current rate, your database will fill its hard drive in 4 hours.”
Exporters
Prometheus manages exporters to well-known services: StatsD, Node, AWS Cloudwatch, InfluxDB, JMX, SNMP, HAProxy, Consul, Memchached, Graphite, Blackbox, etc. See https://prometheus.io/docs/instrumenting/exporters
The WMI Exporter provides system metrics for Windows servers.
Custom exporters are in the category of: database, messaging systems, APIs, logging, storage, hardware related, HTTP, etc.
Ports:
- 9100 - Node exporter
- 9101 - HAProxy exporter
- 9102 - StatsD exporter
- 9103 - Collectd exporter
- 9108 - Graphite exporter
- 9110 - Blackbox exporter
Node Exporter
The Prometheus Node Explorer has its own repo at https://github.com/prometheus/node_exporter
To download a release from GitHub:
https://github.com/prometheus/node_exporter/releases
# TODO: Identify latest version URL in https://prometheus.io/download/#node_exporter # TODO: Code different downloads for Darwin vs. other OS: wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz # https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.darwin-386.tar.gz # v0.16.0 is dated 2018-05-15 tar -xzf node_exporter-* sudo cp node_exporter-*/node_exporter /usr/local/bin/
node_exporter --version
A sample response (at time of writing):
node_exporter, version 0.16.0 (branch: HEAD, revision: 6 e2053c557f96efb63aef3691f15335a70baaffd) . . .
The node_exporter exporter runs, by default, on port 9100 to expose metrics, but can be changed:
node_exporter --web.listen-address=":9100" \ --web.telemetrypath="/node_metrics"
And:
scrape_configs: - job_name: "prometheus" metrics_path: "/metrics" static_configs: - targets: - "localhost:9090" - job_name: node file_sd_configs: - files: - "/etc/prometheus/file_sd/node.yml"
Metrics exposition format
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds. # TYPE http_request_duration_microseconds summary http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 73334.095
Operator
TBD
Alert Manager
The Alert Manager uses port 9093 by default.
Alert Manager
The Prometheus Alert Manager is used to generate alerts.
A sample config:
alerting: alertmanagers: - scheme: https static_configs: - targets: - "1.2.3.4:9093" - "1.2.3.5:9093" - "1.2.3.6:9093"
- routing
- sending
- grouping
- deduplication
Functions:
- silencing
- inhibition
Under development: A cluster of Alertmanager instances form a mesh configuration ensure High Availability.
Integrations include:
- hipchat
- pagerduty
- pushover
- slack
- opsgenie
- webhook
- victorops
PromQL Query Language
Promethus provides multiple modes of graphing and dashboarding support, but also exposes its time-series data to API clients such as Grafana which make PromQL (Prometheus query language) to extract data in order to display visualizations on their websites.
Core metrics generated by Prometheus:
- Counter of increasing value (such as packets received)
- Gauge - a current value that increases or decreases (such as memory usage)
- Histogram is a graphical display of value dispersion
- Summary presents an overview of totals.
histogram_quantile( 0.90, sum without(code,instance)( rate(http_request_seconds[5m]) )))
Client libraries
Embed official client libraries:
Unofficial third-party client libraries:
- Bash
- C++
- Common Lisp
- Elixir
- Erlang
- Haskell
- Lua for Nginx
- Lua for Tarantool
- .NET / C#
- node.js prom-client
- PHP
- Rust
Video courses
If you have a subscription to OReilly.com, Sander van Vugt has a video course on Kubernetes and Cloud Native Associate (KCNA) published by Pearson IT Certification. He also has a live course 6-10am MT Dec 1 & 2, 2022.
Resources
“Monitoring with Prometheus” is 360 pages at https://prometheusbook.com is by James Turnbull, who also wrote books about other DevOps tools: Kubernetes, Packer, Terraform, Logstash, Puppet, etc. based on his work as CTO at Kickstarter, VP of Services and Support at Docker, VP of Engineering at Venmo, and VP of Technical Operations at Puppet. The book is hands-on for Prometheus version 2.3.0 (build date 20171006-22:16:15) on a Linux distribution. However, the author promises updates even though he is busy as CTO at Empatico. Code for the book is at:
- https://github.com/turnbullpress/prometheusbook-code by the author.
- https://github.com/yunlzheng/prometheus-book is a 3rd-party Chinese translation
Turnbull suggests monitoring for “correctness”, not just their status, starting with business metrics, then application (https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html#xref_monitoring_golden-signals), then operating system metrics to avoid “cargo cult” delusions. An example is monitoring for rates of business transactions rather than server uptime.
Bryan Brazil blogs about Prometheus at https://www.robustperception.io/blog/ The blog mentions his trainings. He is working on a on Safari Book “Prometheus: Up & Running”.
paulfantom/workshop-docker
Monitoring, the Prometheus Way May 8, 2017 by Julius Volz - Co-Founder, Prometheus
Infrastructure and application monitoring using Prometheus at Devox UK May 17, 2017 by Marco Pas
LinuxAcademy video hands-on courses:
-
Monitoring Infrastructure and Containers with Prometheus: Prometheus is used to monitor infrastructure and applications at multiple levels: on the host itself, on any containers, and on the application. This hands-on lab addresses monitoring of virtual machine host and containers. It begins by setting up monitoring for a virtual machine using Prometheus’s Node Exporter. Then set up container monitoring for the provided container using Google’s cAdvisor.
View metrics in Prometheus across two levels of a system to track changes and view trends.
-
DevOps Monitoring Deep Dive by Elle Krout references an interactive Lucid diagram called “ProjectForethought” for the NodeJs simple to-do list program called Forethought that is the subject of monitoring.
Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host. It contains a Dockerized Ubuntu (18.04 Bionic Beaver LTS).
So
apt-get
commands are used to install Prometheus, Alertmanager, and Grafana.docker run –name ft-app -p 80:8080 -d forethought
The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.
Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.
Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.
The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.
Other notes
https://timber.io/blog/prometheus-the-good-the-bad-and-the-ugly/
https://eng.uber.com/m3/ Uber open-sourced their M3 Metrics platform for Prometheus in 2018