Wilson Mar bio photo

Wilson Mar

Hello!

Email me Calendar Skype call

LinkedIn Twitter Gitter Instagram Youtube

Github Stackoverflow Pinterest

Collect metrics (for visualization by Grafana), analyze using PromQL coding, and identify alerts, free from CNCF, especially for Kubernetes

US (English)   Norsk (Norwegian)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

The name Prometheus comes from Greek mythology. The Titan Prometheus was an immortal servant of the gods, who stole fire and gave it to humankind. This changed the human race forever (for better and worse). But this made mankind dangerous to the gods. Ridley Scott named his 2012 film “Prometheus”, saying: “It’s the story of creation; the gods and the man who stood against them.”

Unlike the legacy “statsd” daemon which is concerned only with system-level metrics such as CPU, Memory, etc., the tool Prometheus (at https://prometheus.io) gathers metrics from targets at the cluster, node, and microservice API levels.

prometheus-v01-1531x644-58322.jpg *

Prometheus has a run service that pulls or scrapes (gathers) metrics on target hosts and applications using instrumention job exporters or other custom metric providers to expose metrics, either directly or via an intermediary push gateway for short-lived jobs.

In addition to static configurations, Prometheus can also discover targets to monitor with its Services Discovery.

Prometheus stores scraped samples locally in its own multi-dimensional numeric time-series database. Unlike central data collectors (such as Splunk), each Prometheus server runs as distributed standalone and thus not dependent on network storage or other remote services. So it’s available even when other parts of the infrastructure are broken.

Rules running in the Prometheus database either aggregate and record new time series from existing data.

Promethus provides multiple modes of graphing and dashboarding support, but also exposes its time-series data to API clients such as Grafana which make PromQL (Prometheus query language) to extract data in order to display visualizations on their websites.

Because people can’t be always watching such screens, Rules are also set in Prometheus to trigger alerts pushed to the Alert Manager which notifies end-points such as email, Slack, Pager Duty SMS, or other notification mechanisms.

PCA Exam

The $250 90-minute Prometheus Certified Associate (PCA) exam at https://training.linuxfoundation.org/certification/prometheus-certified-associate/ is based on the PaC (Project Forethought) application, which is a simple to-do list program written in Node.js. It is Dockerized and deployed to a virtual machine. The application is instrumented with Prometheus client libraries to track metrics across the app. The exam’s domains:

18% Observability Concepts

  • Metrics
  • Understand logs and events
  • Tracing and Spans
  • Push vs Pull
  • Service Discovery
  • Basics of SLOs, SLAs, and SLIs

20% Prometheus Fundamentals

  • System Architecture
  • Configuration and Scraping
  • Understanding Prometheus Limitations
  • Data Model and Labels
  • Exposition Format

28% PromQL

  • Selecting Data
  • Rates and Derivatives
  • Aggregating over time
  • Aggregating over dimensions
  • Binary operators
  • Histograms
  • Timestamp Metrics

16% Instrumentation and Exporters

  • Client Libraries
  • Instrumentation
  • Exporters
  • Structuring and naming metrics

18% Alerting & Dashboarding

  • Dashboarding basics
  • Configuring Alerting rules
  • Understand and Use Alertmanager
  • Alerting basics (when, what, and why)

A PCA digital credential ensures the candidate understands how to use observability data to improve application performance, troubleshoot system implementations, and feed that data into other systems.

Sample app

The $299 course “Monitoring Infrastructure and Containers with Prometheus” (LFS241) uses the PaC (Project Forethought) application, which is a simple to-do list program written in Node.js. It is Dockerized and deployed to a virtual machine. The application is instrumented with Prometheus client libraries to track metrics across the app.

  1. Course Introduction
  2. Introduction to Systems and Service Monitoring
  3. Introduction to Prometheus
  4. Installing and Setting Up Prometheus
  5. Basic Querying
  6. Dashboarding
  7. Monitoring Host Metrics
  8. Monitoring Container Metrics
  9. Instrumenting Code
  10. Building Exporters
  11. Advanced Querying
  12. Relabeling
  13. Service Discovery
  14. Blackbox Monitoring
  15. Pushing Data
  16. Alerting
  17. Making Prometheus Highly Available
  18. Recording Rules
  19. Scaling Prometheus Deployments
  20. Prometheus and Kubernetes
  21. Local Storage
  22. Remote Storage Integrations
  23. Transitioning From and Integrating with Other Monitoring Systems
  24. Monitoring and Debugging Prometheus

Learning Environment

The “DevOps Monitoring Deep Dive” video course by Elle Krout references an interactive Lucid diagram called “ProjectForethought” for the NodeJs simple to-do list program called Forethought that is the subject of monitoring.

  1. Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host.
  2. When “READY”, click the Distribution name “DevOps Monitoring Deep Dive” for details.
  3. Highlight and copy the Temp. Password by clicking the copy icon.
  4. Click “Terminal” to open another browser window.
  5. Type “cloud_user” to login:
  6. Paste the password.
  7. For a new password, I paste the password again, but add an additional character.
  8. Again to confirm.

  9. When an environment is opened, highlight and copy this command:

    bash -c "$(curl -fsSL https://raw.githubusercontent.com/wilsonmar/DevSecOps/master/Prometheus/prometheus-setup.sh)"
  10. Copy the password to your computer’s Clipboard.
  11. Switch to the Terminal to paste, which runs the script.
  12. Paste the password when prompted.

  13. To rerun the script, discard the current instance and create a new instance.

    The script is self-documented, but below are additional comments:

Below is a description of Docker

  1. Confirm the creation of the existing Docker image:

    docker image list

    The response lists “forethought” as a Docker image.

  2. List the contents of the forethought directory and subdirectories:

    ls -d
  3. Deploy the web application to a container. Map port 8080 on the container to port 80 on the host:

    docker run –name ft-app -p 80:8080 -d forethought

  4. Check that the application is working correctly by visiting the server’s provided URL.

    In the script, this is done using a curl script and examining the HTML response.

  5. Install

    NOTE: The Terminal is inside a Dockerized Ubuntu (18.04 Bionic Beaver LTS) image. So apt-get commands are used to install Prometheus, Alertmanager, and Grafana.

    The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.

    Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.

    Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.

    The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.

Kubernetes

Prometheus joined the CNCF (Cloud Native Computing Foundation) in 2016 as its second hosted project after Kubernetes. So naturally, Prometheus works with K8s. See https://github.com/kayrus/prometheus-kubernetes.

In late 2016, CoreOS introduced the Operator pattern and released an example using that pattern in Prometheus Operatorn. It automatically creates/configures/manages Prometheus monitoring instances in clusters atop Kubernetes. See https://github.com/coreos/prometheus-operator and https://devops.college/prometheus-operator-how-to-monitor-an-external-service-3cb6ac8d5acb

PROTIP: Prometheus has not reached “1.0” yet so use of apt-get, yum, brew, installer packages are not recommended at this time for production use. But that hasn’t stopped many from using it in production.

$ cd /tmp
$ wget https://github.com/prometheus/prometheus/releases/download/v2.2.0/prometheus-2.2.0.linux-amd64.tar.gz
$ tar -xzf prometheus-2.2.0.linux-amd64.tar.gz
 
$ sudo chmod +x prometheus-2.2.0.linux-amd64/{prometheus,promtool} 
$ sudo cp prometheus-2.2.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
$ sudo chown root:root /usr/local/bin/{prometheus,promtool}
 
$ sudo mkdir -p /etc/prometheus
$ sudo vim /etc/prometheus/prometheus.yml
$ promtool check config prometheus.yml
 
Checking prometheus.yml
SUCCESS: 0 rule files found
 
$ prometheus --config.file "/etc/prometheus/prometheus.yml" &

Ansible installer

Paweł Krupa (@paulfantom, author of the Docker Workshop) and Roman Demachkovych (@rdemachkovych), together as Cloud Alchemy, defined a presentation about their Ansible role for Prometheus, with https://demo.cloudalchemy.org.

  • Zero-configuration deployment
  • Easy management of multiple nodes
  • Error checking
  • Multiple CPU architecture support

  • versioning
  • system user management
  • CPU architecture auto-detection
  • systemd service files
  • linux capabilites support
  • basic SELinux (Security-Enhanced Linux) security module support

https://travis-ci.org/cloudalchemy/demo-site

Starting Prometheus

To run Prometheus after downloading the Docker image from the “prom” account in Dockerhub:

docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus.yml prom/prometheus

Start Docker and try again if you get this error message:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

The expected message is:

msg="Server is ready to receive web requests."

The location of the database and the retention period are controlled by command line options: Add –storage.tsdb.path for another path. Add –storage.tsdb.retention to specify another retention period than the default 15d (days).

  1. Open a browser to see the log at:

    http://localhost:9090/metrics

    There is no menu item to view the page.

  2. Open a browser to see the Graph at the URL home page:

    http://localhost:9090

    prometheus-graph-menu-403x380-51898.jpg

    The above example are metrics for the Go language/virtual machine running locally.

    NOTE: https://prometheus.io/docs contains docs. It says in 2012 SoundCloud wrote Prometheus in Golang and open sourced it at https://github.com/prometheus.

Graphing specs

  1. TODO: Select “go_gc_duration_seconds” for the median, which is 50th quantile, specified as:

    rate(prometheus_tsdb_head_samples_appended_total[1m])

    Also:

    go_gc_duration_seconds{instance="localhost:9090",job="prometheus",quantile="0.5"}

    See https://prometheus.io/docs/prometheus/latest/storage

  2. Press Execute.
  3. Click “Graph”.

    Notice the granularity of timing on the horizontal axis. Thousands of a second.

    Configuring Prometheus.yml

  4. Open a browser to http://localhost:9090/config

    prometheus.yml is the configuration file that contains these blocks: global, rule_files, and scrape_configs. Optionally, there are remote_read, remote_write, alerting.

    global:
      evaluation_interval: 15s
      scrape_interval: 15s
      scrape_timeout: 10s
     
     external_labels:
     environment: localhost.localdomain
    

    In the global block, scrape_interval specifies the frequency of 15s (seconds) which Prometheus scrapes targets. (The default for this is every 60 seconds)

    The default evaluation_interval of 15s controls how often Prometheus evaluates rule files that specify creation of new time series and generation of alerts.

    Its uniqueness is a rules engine that enables alerts by the Prometheus Alertmanager installed separately.

    Recording rules enable precompute of frequent and expensive expressions and to save their result as derived time series data.

    Scrape configs

    This defines the job that scrapes the Prometheus web UI:

    scrape_configs:
      - job_name: 'prometheus'
        metrics_path: "/metrics"
        static_configs:
        - targets: ['localhost:9090']
             - job_name: node
     file_sd_configs:
     - files:
       - "/etc/prometheus/file_sd/node.yml"
    

    There can be several jobs named in a config, named x, y, and z in the sample config file.

Local start

Alternately,

PROTIP: Using /etc/prometheus would require sudo, but ~/.prometheus would not.

  1. Create a folder to hold the Prometheus configuration file, then CD to it:

    cd ~ ; mkdir .prometheus ; cd .prometheus
  2. Create a Prometheus configuration file in the folder or copy in then edit a full template example at:

    https://github.com/prometheus/prometheus/blob/release-2.3/config/testdata/conf.good.yml

  3. Validate yaml syntax online:

    https://github.com/adrienverge/yamllint

  4. Validate for content using the promtool in the Prometheus bin folder:

    promtool check config prometheus.yml

    An example error message is:

    Checking prometheus.yml
      FAILED: parsing YAML file prometheus.yml: yaml: line 13: did not find expected '-' indicator
    

    The expected response is: “SUCCESS: 0 rule files found”.

  5. To run Prometheus locally in the directory containing the Prometheus binary:

    <pre><strong>./prometheus --config.file=prometheus.yml</strong></pre>
    

    Additional parameters, for example:

    level=info ts=2017-10-23T14:03:02.274562Z caller=main.go:216 msg="Starting prometheus"...

    Althugh an Alertmanager is not required to run Prometheus,…

Command

# Ansible managed file. Be wary of possible overwrites.
[Unit]
Description=Prometheus
After=network.target
 
[Service]
Type=simple
Environment="GOMAXPROCS=1"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \     
  --storage.tsdb.retention=30d \                
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=http://demo.cloudalchemy.org:9090
 
SyslogIdentifier=prometheus                                                                                              Restart=always
 
[Install]
WantedBy=multi-user.target

App Metrics

The four golden signals of monitoring begins with:

  • Latency

    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

  • Traffic

    A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.

To identify bottlenecks, instead of beginning with given metrics (partial answers) and trying to work backwards, the Utilization Saturation and Errors (USE) Method by Brendan Gregg (of Netflix), described at http://www.brendangregg.com/usemethod.html, begins by posing questions off a checklist, and then seeks answers. To direct the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors.

  • Utilization

    the average time that the resource was busy servicing work.

  • Errors

    The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

  • Saturation

    How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

Predictive: saturation is the basis for projections of impending issues, such as “at the current rate, your database will fill its hard drive in 4 hours.”

Exporters

Prometheus manages exporters to well-known services: StatsD, Node, AWS Cloudwatch, InfluxDB, JMX, SNMP, HAProxy, Consul, Memchached, Graphite, Blackbox, etc. See https://prometheus.io/docs/instrumenting/exporters

The WMI Exporter provides system metrics for Windows servers.

Custom exporters are in the category of: database, messaging systems, APIs, logging, storage, hardware related, HTTP, etc.

Ports:

Node Exporter

The Prometheus Node Explorer has its own repo at https://github.com/prometheus/node_exporter

To download a release from GitHub:

https://github.com/prometheus/node_exporter/releases

# TODO: Identify latest version URL in https://prometheus.io/download/#node_exporter
# TODO: Code different downloads for Darwin vs. other OS:
wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz
   # https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.darwin-386.tar.gz
   # v0.16.0 is dated 2018-05-15
tar -xzf node_exporter-*
sudo cp node_exporter-*/node_exporter /usr/local/bin/
node_exporter --version

A sample response (at time of writing):

node_exporter, version 0.16.0 (branch: HEAD, revision: 6
e2053c557f96efb63aef3691f15335a70baaffd)
. . .

The node_exporter exporter runs, by default, on port 9100 to expose metrics, but can be changed:

node_exporter --web.listen-address=":9100" \
   --web.telemetrypath="/node_metrics"

And:

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
  - job_name: node
    file_sd_configs:
    - files:
      - "/etc/prometheus/file_sd/node.yml"
   

Metrics exposition format

# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 73334.095
   

Operator

TBD

Alert Manager

The Alert Manager uses port 9093 by default.

Alert Manager

The Prometheus Alert Manager is used to generate alerts.

A sample config:

alerting:
  alertmanagers:
  - scheme: https
    static_configs:
    - targets:
      - "1.2.3.4:9093"
      - "1.2.3.5:9093"
      - "1.2.3.6:9093"
   
  • routing
  • sending
  • grouping
  • deduplication

Functions:

  • silencing
  • inhibition

Under development: A cluster of Alertmanager instances form a mesh configuration ensure High Availability.

Integrations include:

  • email
  • hipchat
  • pagerduty
  • pushover
  • slack
  • opsgenie
  • webhook
  • victorops

PromQL Query Language

Promethus provides multiple modes of graphing and dashboarding support, but also exposes its time-series data to API clients such as Grafana which make PromQL (Prometheus query language) to extract data in order to display visualizations on their websites.

Core metrics generated by Prometheus:

  1. Counter of increasing value (such as packets received)
  2. Gauge - a current value that increases or decreases (such as memory usage)
  3. Histogram is a graphical display of value dispersion
  4. Summary presents an overview of totals.
histogram_quantile(
  0.90,
  sum without(code,instance)(
   rate(http_request_seconds[5m])
)))

Client libraries

Embed official client libraries:

Unofficial third-party client libraries:

Video courses

If you have a subscription to OReilly.com, Sander van Vugt has a video course on Kubernetes and Cloud Native Associate (KCNA) published by Pearson IT Certification. He also has a live course 6-10am MT Dec 1 & 2, 2022.

Resources

“Monitoring with Prometheus” is 360 pages at https://prometheusbook.com is by James Turnbull, who also wrote books about other DevOps tools: Kubernetes, Packer, Terraform, Logstash, Puppet, etc. based on his work as CTO at Kickstarter, VP of Services and Support at Docker, VP of Engineering at Venmo, and VP of Technical Operations at Puppet. The book is hands-on for Prometheus version 2.3.0 (build date 20171006-22:16:15) on a Linux distribution. However, the author promises updates even though he is busy as CTO at Empatico. Code for the book is at:

Turnbull suggests monitoring for “correctness”, not just their status, starting with business metrics, then application (https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html#xref_monitoring_golden-signals), then operating system metrics to avoid “cargo cult” delusions. An example is monitoring for rates of business transactions rather than server uptime.

Bryan Brazil blogs about Prometheus at https://www.robustperception.io/blog/ The blog mentions his trainings. He is working on a on Safari Book “Prometheus: Up & Running”.

paulfantom/workshop-docker

Monitoring, the Prometheus Way May 8, 2017 by Julius Volz - Co-Founder, Prometheus

Infrastructure and application monitoring using Prometheus at Devox UK May 17, 2017 by Marco Pas

LinuxAcademy video hands-on courses:

  • Monitoring Infrastructure and Containers with Prometheus: Prometheus is used to monitor infrastructure and applications at multiple levels: on the host itself, on any containers, and on the application. This hands-on lab addresses monitoring of virtual machine host and containers. It begins by setting up monitoring for a virtual machine using Prometheus’s Node Exporter. Then set up container monitoring for the provided container using Google’s cAdvisor.

    View metrics in Prometheus across two levels of a system to track changes and view trends.

  • DevOps Monitoring Deep Dive by Elle Krout references an interactive Lucid diagram called “ProjectForethought” for the NodeJs simple to-do list program called Forethought that is the subject of monitoring.

    Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host. It contains a Dockerized Ubuntu (18.04 Bionic Beaver LTS).

    So apt-get commands are used to install Prometheus, Alertmanager, and Grafana.

    docker run –name ft-app -p 80:8080 -d forethought

    The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.

    Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.

    Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.

    The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.

Other notes

https://timber.io/blog/prometheus-the-good-the-bad-and-the-ugly/

https://eng.uber.com/m3/ Uber open-sourced their M3 Metrics platform for Prometheus in 2018