Wilson Mar bio photo

Wilson Mar

Hello!

Calendar YouTube Github

LinkedIn

Collect metrics (for visualization by Grafana), analyze using PromQL coding, and identify alerts, especially for Kubernetes (also from CNCF).

US (English)   Norsk (Norwegian)   Español (Spanish)   Français (French)   Deutsch (German)   Italiano   Português   Estonian   اَلْعَرَبِيَّةُ (Egypt Arabic)   Napali   中文 (简体) Chinese (Simplified)   日本語 Japanese   한국어 Korean

Overview

This is a deep dive into getting started using Prometheus in enterprise production.

How it works

From the PowerPoint file animations used to create this VIDEO and diagram:
prometheus-flow-240113-1739x838.png

  1. The core component of Prometheus is a server service. (written in Golang), so there are no additional VM installs like with Java, Python, etc.

  2. Installers for Prometheus have been created in Homebrew for macOS, Ubuntu, and Windows. Commands to invoke them, along with sample configuration files are in a sample project GitHub repo.

  3. Prometheus is configured by editing the prometheus.yaml file. One comes with the installer, but you should have a customized file in each app github repo. Start the server.

  4. The Prometheus server sends HTTPS GET requests to scrape (pull) metrics from target hosts defined in its targets.json file. In addition to statically-defined targets,

    A single Prometheus server can handle up to 1,000 scrape targets.

  5. Targets can be discovered by Service Discovery such as DNS, Kubernetes Services, or HashiCorp Consul services. The frequency of scraping and other settings are defined in the prometheus.yml file.

  6. Each target interacts with Prometheus through a job exporter service installed on each host. There is a WMI exporter on Windows and a type of exporter on Linux, etc.

  7. This can be done via an intermediary push gateway for short-lived jobs.

  8. Exporters reference custom metric providers which expose specific metrics.

  9. Unlike the legacy statsd daemon which is concerned only with system-level metrics such as CPU, Memory, etc., the tool Prometheus (at https://prometheus.io) gathers metrics from targets at the cluster, node, and microservice API levels.

    NOTE: A single Prometheus server can handle up to 1,000 scrape targets, at 100,000+ samples per second. But for larger deployments, multiple Prometheus servers can be deployed in a federated architecture, with a root Prometheus server scraping data from the child servers.

  10. Prometheus stores scraped samples locally in its own multi-dimensional numeric time-series database (TSDB).

  11. Rules defined in the Prometheus TSDB can be defined with filtering and aggregate new time series from data.

    Each Prometheus server runs distributed standalone so thus not dependent on network storage or other remote services. So it’s available even when other parts of the infrastructure are broken.

    PROTIP: Data on Prometheus servers should be considered short-lived (temporary) because data on it can be lost if the server is restarted.

  12. Prometheus exposes its time-series data to a variety of API clients making PromQL (Prometheus Query Language) statements which extract data.

  13. Many enterprises have API clients go through an API Gateway to enforce strong authentication and traffic limits from specific users.

  14. Data on Prometheus should be frequently sent to a long-term storage system such as AWS S3, InfluxDB, Elasticsearch, Microsoft Fabric, etc.

    PROTIP: Sending operational data to a central enterprise repository would enable a central SOC (Security Operations Center) to correlate events from throughout the enterprise and outside the enterprise. This approach does require diligence at managing disk space and retention.

  15. PROTIP: To minimize training and confusion, enterprise organizations typically have a preferred set of tools for analytics processing to generate graphs and dashboards for visualization.

  16. The same vendor who created Prometheus also created Grafana to present dashboards to visualize data.

    NOTE: Grafana can also be used to visualize data from other sources such as

  17. PROTIP: When using S3, by default, Prometheus references a static file of long-lived credentials for authentication. To prevent compromise, many organizations use a Credentials insert utility such as HashiCorp Vault which dynamically creates a new set of S3 credentials every time before running the backup.

  18. PROTIP: Cloud storage mechanisms have a backup mechanism to restore data in case of failure. Practice restoring data to a new server to ensure that the backup mechanism actually works.

  19. NOTE: The Prometheus server can be configured to read data from remote sources – perform remote read.

  20. Because people can’t be always watching dashboard screens, Alerting Rules are set to trigger alerts pushed to the Alert Manager which forwards to various end-points such as email, Slack, Pager Duty, SMS, OpsGenie, or other notification mechanisms.

    NOTE: In a HA configuration, alerts are sent to multiple Alert Managers (with different external labels -a and -b), which deduplicate and fan out alerts to their configured receivers.

  21. PROTIP: To minimize training and confusion, enterprise organizations typically have a preferred set of tools for analytics processing to generate graphs and dashboards for visualization.

QUESTION: Prometheus does not collect event data from operating systems or logs emitted from applications.

The for each Prometheus:

  • /status = Run-time and build information
  • /flags = Command-line flags and their values
  • /config = The currently loaded configuration
  • /rules = Rule evaluation states

  • /target = Scrape status information for all active targets
  • /service-discovery = Service discovery status information
  • /tsdb-status = TSDB status information

Competitors

Prometheus is born of the cloud age which can accommodate large databases. Prometheus stores data in a standalone time series database that passively store metrics.

Prometheus differs from previous fault-detection systems, like Nagios, which run periodic check scripts but keep little historical data. Prometheus saves historical data in a Time Series Data Base (TSDB).

NOTE: The “TIK” stack (Telegraf + InfluxDB + Grafana):

  • Data Collection: Prometheus is a pull-based system where it collects data by pulling metrics from targets. On the other hand, Telegraf supports both pull and push mechanisms. It can collect data by pulling metrics from various sources as well as pushing metrics to different destinations.

  • Scalability: Prometheus is designed to be highly scalable and can handle large amounts of data. It achieves scalability through a federated architecture where multiple Prometheus servers can be federated together. Telegraf, on the other hand, is a lightweight collector that can be deployed on a large number of machines, making it highly scalable as well.

  • Data Processing: Prometheus comes with its own query language called PromQL, which allows advanced querying and processing of metrics. With PromQL, users can perform aggregations, filtering, and math operations on the collected data. Telegraf, on the other hand, focuses more on data collection and routing, leaving the data processing task to other tools in the stack.

  • Plugin Ecosystem: Telegraf has a rich plugin ecosystem, which allows easy integration with various systems and technologies. It supports a wide range of input and output plugins, making it flexible and extensible. Prometheus, on the other hand, focuses more on the core monitoring and alerting functions and has a limited number of official plugins.

  • Alerting: Prometheus has a built-in alerting system that allows users to define alert rules based on the collected metrics. It supports various notification channels, such as email, Slack, and PagerDuty, to send alerts when certain conditions are met. Telegraf, on the other hand, does not have a built-in alerting system and relies on other tools in the monitoring stack for alerting functionality.

There are also Elasticsearch, Datadog, and other SIEM paid solutions.

What is “Observability”?


Prometheus Origins

“Prometheus” is, in Ancient Greek mythology, a Titan of forethought, fire, and crafty counsel – an immortal servant of the gods. He stole fire and gave it to humankind. This changed the human race forever (for better and worse). But this made mankind dangerous to the gods.

Ridley Scott named his 2012 film “Prometheus”, saying: “It’s the story of creation; the gods and the man who stood against them.”

VIDEO: https://prometheus.io/docs/introduction/overview/
The software named Prometheus began at SoundCloud in 2012, where ex-Google SREs (Site Reliability Engineers) adopted Google’s Borgmon.

prometheus-borgmon-1856x1136.png

Prometheus was open-sourced in 2015 at https://github.com/prometheus/prometheus/releases

Prometheus joined the CNCF (Cloud Native Computing Foundation) in 2016 as its second hosted project after Kubernetes. So as would be expected, Prometheus works with K8s.


PCA Exam

CNCF is under the Linux Foundation, which offers the $250 Prometheus Certified Associate (PCA) exam for beginners who (with one retake) in 90-minutes answer 75% of 60 questions correctly around these domains:

18% Observability Concepts

  • Metrics
  • Understand logs and events
  • Tracing and Spans
  • Push vs Pull
  • Service Discovery
  • Basics of SLOs, SLAs, and SLIs

20% Prometheus Fundamentals

  • System Architecture
  • Configuration and Scraping
  • Understanding Prometheus Limitations
  • Data Model and Labels
  • Exposition Format

28% PromQL

  • Selecting Data
  • Rates and Derivatives
  • Aggregating over time
  • Aggregating over dimensions
  • Binary operators
  • Histograms
  • Timestamp Metrics

16% Instrumentation and Exporters

  • Client Libraries
  • Instrumentation
  • Exporters
  • Structuring and naming metrics

18% Alerting & Dashboarding

References:

  • https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca
  • https://docs.linuxfoundation.org/tc-docs/certification/frequently-asked-questions-pca
  • https://trainingportal.linuxfoundation.org/learn/course/prometheus-certified-associate-pca/exam/exam
  • https://medium.com/@onai.rotich/prometheus-certified-associate-a-comprehensive-guide-9c51638578d2

Candidate Handbook reference the PSI BRIDGE Proctoring platform.


Courses

Linux Foundation developed a 2018 course.

VIDEO: Prometheus co-founder Julius Volz (in Berlin, Germany) authored the Linux Foundation’s 2018 course which is now dated. So he created promlabs.com to offer a $349 video training on Prometheus monitoring fundamentals, tutorials, quizzes, tips, and best practices. It’s based on Ubuntu Linux 20.04. Julius also built Promlens query builder.

KodeKloud

https://kodekloud.com/courses/prometheus-certified-associate-pca KodeKloud PCA Certification Course

OReilly.com

If you have a subscription to OReilly.com:

By Bipin:

By Sean Bradley:

By Alex Philip:

  • Install on macOS (here below) his Setup Prometheus stack using Docker Compose within AWS December 2, 2023 referencing folder 04-prometheus-observability-stack within
    https://github.com/techiescamp/devops-projects

    .
    ├── CODE_OF_CONDUCT.md
    ├── CONTRIBUTING.md
    ├── LICENSE
    ├── Makefile
    ├── README.md
    ├── SECURITY.md
    ├── alertmanager
    │   └── alertmanager.yml
    ├── docker-compose.yml
    ├── prometheus
    │   ├── alertrules.yml
    │   ├── prometheus.yml
    │   └── targets.json
    └── terraform-aws
      ├── README.md
      ├── modules
      │   ├── ec2
      │   │   ├── main.tf
      │   │   ├── outputs.tf
      │   │   ├── user-data.sh
      │   │   └── variables.tf
      │   └── security-group
      │       ├── main.tf
      │       ├── outputs.tf
      │       └── variables.tf
      ├── prometheus-stack
      │   ├── main.tf
      │   ├── outputs.tf
      │   └── variables.tf
      └── vars
          └── ec2.tfvars
     

By Edward Viaene and Jorn Jambers:

Pluralsight

On Pluralsight.com, the tale their Skill IQ after their 9-hour Event Monitoring and Alerting with Prometheus path of 4 courses.

By Elton Stoneman (sixeyed.com):

docker run -d -p 8080:80 --name web sexeyed/prometheus-demo-web:windows

By Chris Green (direct-root.com):

By Craig Golightly (seethatgo.com):

By Marcos Souza:

  • https://github.com/ACloudGuru-Resources/DevOps-Monitoring-Deep-Dive

By Elle Krout:

Others:

walidshaari/PrometheusCertifiedAssociate

edgarpf/prometheus-certified-associate

Al-HusseinHameedJasim/prometheus-certified-associate

Sample app

The $299 course “Monitoring Infrastructure and Containers with Prometheus” (LFS241) is based on the PaC (Project Forethought) application, which is a simple to-do list program written in Node.js. It is Dockerized and deployed to a virtual machine. The application is instrumented with Prometheus client libraries to track metrics across the app.

  1. Course Introduction
  2. Introduction to Systems and Service Monitoring
  3. Introduction to Prometheus

  4. Installing and Setting Up Prometheus

  5. Basic Querying
  6. Dashboarding
  7. Monitoring Host Metrics
  8. Monitoring Container Metrics
  9. Instrumenting Code
  10. Building Exporters
  11. Advanced Querying
  12. Relabeling
  13. Service Discovery
  14. Blackbox Monitoring
  15. Pushing Data
  16. Alerting
  17. Making Prometheus Highly Available
  18. Recording Rules
  19. Scaling Prometheus Deployments
  20. Prometheus and Kubernetes
  21. Local Storage
  22. Remote Storage Integrations
  23. Transitioning From and Integrating with Other Monitoring Systems
  24. Monitoring and Debugging Prometheus

  25. Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host. https://github.com/linuxacademy/content-devops-monitoring-app

  26. When “READY”, click the Distribution name “DevOps Monitoring Deep Dive” for details.
  27. Highlight and copy the Temp. Password by clicking the copy icon.
  28. Click “Terminal” to open another browser window.
  29. Type “cloud_user” to login:
  30. Paste the password.
  31. For a new password, I paste the password again, but add an additional character.
  32. Again to confirm.

  33. When an environment is opened, highlight and copy this command:

    bash -c "$(curl -fsSL https://raw.githubusercontent.com/wilsonmar/DevSecOps/master/Prometheus/prometheus-setup.sh)"
  34. Copy the password to your computer’s Clipboard.
  35. Switch to the Terminal to paste, which runs the script.
  36. Paste the password when prompted.

  37. To rerun the script, discard the current instance and create a new instance.

    The script is self-documented, but below are additional comments:

blackbox_exporter

This exporter actively probes target service endpoints from the outside to get Prometheus metrics.

VERV=$(curl –silent -qI https://github.com/prometheus/blackbox_exporter/releases/latest | awk -F ‘/’ ‘/^location/ {print substr($NF, 1, length($NF)-1)}’); echo $VERV #v0.24.0 VER=${VERV:1} echo $VER wget “https://github.com/prometheus/blackbox_exporter/releases/download/$VERV/blackbox_exporter-$VER.linux-amd64.tar.gz”

Unpack it.

tar xvfz “blackbox_exporter-$VERV.linux-amd64.tar.gz”

ls blackbox_exporter-0.24.0.linux-amd64.tar.gz


Homebrew on macOS install

  1. Install Homebrew using my directions for configuring PATH, etc.
  2. Open a Terminal, at any folder, see information about the formula:

    brew info prometheus

    The response at time of writing:

    ==> prometheus: stable 2.49.1 (bottled)
    Service monitoring system and time series database
    https://prometheus.io/
    Not installed
    From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/p/prometheus.rb
    License: Apache-2.0
    ==> Dependencies
    Build: gnu-tar ✘, go ✘, node ✔, yarn ✘
    ==> Caveats
    When run from `brew services`, `prometheus` is run from
    `prometheus_brew_services` and uses the flags in:
    /usr/local/etc/prometheus.args
     
    To start prometheus now and restart at login:
      brew services start prometheus
    Or, if you don't want/need a background service you can just run:
      /usr/local/opt/prometheus/bin/prometheus_brew_services
    ==> Analytics
    install: 1,224 (30 days), 3,645 (90 days), 12,333 (365 days)
    install-on-request: 1,224 (30 days), 3,645 (90 days), 12,332 (365 days)
    build-error: 4 (30 days)
    
  3. Install using the Homebrew formula:

    brew install prometheus

    The response at time of writing:

    ==> Downloading https://ghcr.io/v2/homebrew/core/prometheus/manifests/2.49.1
    ######################################################################################################################################### 100.0%
    ==> Fetching prometheus
    ==> Downloading https://ghcr.io/v2/homebrew/core/prometheus/blobs/sha256:dee352f2deef862dff57a254049859095fc6e682fe702cc8f190c6aedf8543b7
    ######################################################################################################################################### 100.0%
    ==> Pouring prometheus--2.49.1.ventura.bottle.tar.gz
    ==> Caveats
    When run from `brew services`, `prometheus` is run from
    `prometheus_brew_services` and uses the flags in:
    /usr/local/etc/prometheus.args
     
    To start prometheus now and restart at login:
      brew services start prometheus
    Or, if you don't want/need a background service you can just run:
      /usr/local/opt/prometheus/bin/prometheus_brew_services
    ==> Summary
    🍺  /usr/local/Cellar/prometheus/2.49.1: 22 files, 235.2MB
    ==> Running `brew cleanup prometheus`...
    Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
    Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).
    
  4. Where was it installed?

    which prometheus

    On an Intel with AMD architecture chip:

    /usr/local/bin/prometheus

    On an Apple Silicon M1/M2/M3 ARM architecture chip:

    /usr/local/opt/prometheus

    BTW: ./prometheus (with the ./) is run when you are in a folder containing the executable. But that’s not necessary because /usr/local is typically in the operating system $PATH environment variable.

  5. Verify that the executable can be reached from any folder:

    prometheus --version

    Response at time of writing on an AMD machine:

    prometheus, version 2.49.1 (branch: non-git, revision: non-git)
      build user:       brew@Ventura
      build date:       20240115-16:56:27
      go version:       go1.21.6
      platform:         darwin/amd64
      tags:             netgo,builtinassets,stringlabels
    
  6. List all the parameters, drag the right edge wider for:

    prometheus --help

    Response at time of writing on an AMD machine:

    usage: prometheus [<flags>]
    The Prometheus monitoring server
     
    Flags:
      -h, --[no-]help                Show context-sensitive help (also try --help-long and --help-man).
       --[no-]version             Show application version.
       --config.file="prometheus.yml"  
                                  Prometheus configuration file path.
       --web.listen-address="0.0.0.0:9090"  
                                  Address to listen on for UI, API, and telemetry.
       --web.config.file=""       [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication.
       --web.read-timeout=5m      Maximum duration before timing out read of the request, and closing idle connections.
       --web.max-connections=512  Maximum number of simultaneous connections.
       --web.external-url=URL   The URL under which Prometheus is externally reachable (for example, if Prometheus is served via a reverse
                                  proxy). Used for generating relative and absolute links back to Prometheus itself. If the URL has a path
                                  portion, it will be used to prefix all HTTP endpoints served by Prometheus. If omitted, relevant URL components
                                  will be derived automatically.
       --web.route-prefix=path  Prefix for the internal routes of web endpoints. Defaults to path of --web.external-url.
       --web.user-assets=path   Path to static asset directory, available at /user.
       --[no-]web.enable-lifecycle  
                                  Enable shutdown and reload via HTTP request.
       --[no-]web.enable-admin-api  
                                  Enable API endpoints for admin control actions.
       --[no-]web.enable-remote-write-receiver  
                                  Enable API endpoint accepting remote write requests.
       --web.console.templates="consoles"  
                                  Path to the console template directory, available at /consoles.
       --web.console.libraries="console_libraries"  
                                  Path to the console library directory.
       --web.page-title="Prometheus Time Series Collection and Processing Server"  
                                  Document title of Prometheus instance.
       --web.cors.origin=".*"     Regex for CORS origin. It is fully anchored. Example: 'https?://(domain1|domain2)\.com'
       --storage.tsdb.path="data/"  
                                  Base path for metrics storage. Use with server mode only.
       --storage.tsdb.retention=STORAGE.TSDB.RETENTION  
                                  [DEPRECATED] How long to retain samples in storage. This flag has been deprecated, use
                                  "storage.tsdb.retention.time" instead. Use with server mode only.
       --storage.tsdb.retention.time=STORAGE.TSDB.RETENTION.TIME  
                                  How long to retain samples in storage. When this flag is set it overrides "storage.tsdb.retention". If neither
                                  this flag nor "storage.tsdb.retention" nor "storage.tsdb.retention.size" is set, the retention time defaults to
                                  15d. Units Supported: y, w, d, h, m, s, ms. Use with server mode only.
       --storage.tsdb.retention.size=STORAGE.TSDB.RETENTION.SIZE  
                                  Maximum number of bytes that can be stored for blocks. A unit is required, supported units: B, KB, MB, GB, TB,
                                  PB, EB. Ex: "512MB". Based on powers-of-2, so 1KB is 1024B. Use with server mode only.
       --[no-]storage.tsdb.no-lockfile  
                                  Do not create lockfile in data directory. Use with server mode only.
       --storage.tsdb.head-chunks-write-queue-size=0  
                                  Size of the queue through which head chunks are written to the disk to be m-mapped, 0 disables the queue
                                  completely. Experimental. Use with server mode only.
       --storage.agent.path="data-agent/"  
                                  Base path for metrics storage. Use with agent mode only.
       --[no-]storage.agent.wal-compression  
                                  Compress the agent WAL. Use with agent mode only.
       --storage.agent.retention.min-time=STORAGE.AGENT.RETENTION.MIN-TIME  
                                  Minimum age samples may be before being considered for deletion when the WAL is truncated Use with agent mode
                                  only.
       --storage.agent.retention.max-time=STORAGE.AGENT.RETENTION.MAX-TIME  
                                  Maximum age samples may be before being forcibly deleted when the WAL is truncated Use with agent mode only.
       --[no-]storage.agent.no-lockfile  
                                  Do not create lockfile in data directory. Use with agent mode only.
       --storage.remote.flush-deadline=duration  
                                  How long to wait flushing sample on shutdown or config reload.
       --storage.remote.read-sample-limit=5e7  
                                  Maximum overall number of samples to return via the remote read interface, in a single query. 0 means no limit.
                                  This limit is ignored for streamed response types. Use with server mode only.
       --storage.remote.read-concurrent-limit=10  
                                  Maximum number of concurrent remote read calls. 0 means no limit. Use with server mode only.
       --storage.remote.read-max-bytes-in-frame=1048576  
                                  Maximum number of bytes in a single frame for streaming remote read response types before marshalling. Note
                                  that client might have limit on frame size as well. 1MB as recommended by protobuf by default. Use with server
                                  mode only.
       --rules.alert.for-outage-tolerance=1h  
                                  Max time to tolerate prometheus outage for restoring "for" state of alert. Use with server mode only.
       --rules.alert.for-grace-period=10m  
                                  Minimum duration between alert and restored "for" state. This is maintained only for alerts with configured
                                  "for" time greater than grace period. Use with server mode only.
       --rules.alert.resend-delay=1m  
                                  Minimum amount of time to wait before resending an alert to Alertmanager. Use with server mode only.
       --alertmanager.notification-queue-capacity=10000  
                                  The capacity of the queue for pending Alertmanager notifications. Use with server mode only.
       --query.lookback-delta=5m  The maximum lookback duration for retrieving metrics during expression evaluations and federation. Use with
                                  server mode only.
       --query.timeout=2m         Maximum time a query may take before being aborted. Use with server mode only.
       --query.max-concurrency=20  
                                  Maximum number of queries executed concurrently. Use with server mode only.
       --query.max-samples=50000000  
                                  Maximum number of samples a single query can load into memory. Note that queries will fail if they try to load
                                  more samples than this into memory, so this also limits the number of samples a query can return. Use with
                                  server mode only.
       --enable-feature= ...      Comma-separated feature names to enable. Valid options: agent, exemplar-storage, expand-external-labels,
                                  memory-snapshot-on-shutdown, promql-at-modifier, promql-negative-offset, promql-per-step-stats,
                                  promql-experimental-functions, remote-write-receiver (DEPRECATED), extra-scrape-metrics,
                                  new-service-discovery-manager, auto-gomaxprocs, no-default-scrape-port, native-histograms, otlp-write-receiver.
                                  See https://prometheus.io/docs/prometheus/latest/feature_flags/ for more details.
       --log.level=info           Only log messages with the given severity or above. One of: [debug, info, warn, error]
       --log.format=logfmt        Output format of log messages. One of: [logfmt, json]
    

    PROTIP: An investment of time to get to know what parameters are available can come in handle (less stress for you) during troubleshooting.

    –enable-feature=exemplar-storage enables (currently experimental) exposition format examplars.

    QUESTION: server vs agent mode,

Clone Custom Project

References:

  • https://prometheus.io/docs/prometheus/latest/configuration/configuration/
  • Good example: https://github.com/prometheus/prometheus/blob/release-2.49/config/testdata/conf.good.yml

  1. Setup VSCode or whatever IDE you prefer to use.
  2. Setup Git global configurations, including SSH and GPG for a verified connection to GitHub.com
  3. Create a folder where you can clone a repo into.
  4. Setup Prometheus stack using Docker Compose within AWS December 2, 2023 referencing folder 04-prometheus-observability-stack within

    git clone git@github.com:techiescamp/devops-projects.git --depth 1
    
  5. Remove other DevOps project folders not relevant: ???

  6. Set upstream to the original developer (which contains several DevOps projects):

    git remote upstream -add https://github.com/wilsonmar/devops-projects 
    
  7. Open VSCode within the repo folder:

    cd ~/github-wilsonmar/devops-projects/04-prometheus-observability-stack/
    
  8. Navigate

    cd 04-prometheus-observability-stack/prometheus
  9. Open VSCode within the repo folder:

    code .
    .
    ├── CODE_OF_CONDUCT.md
    ├── CONTRIBUTING.md
    ├── LICENSE
    ├── Makefile
    ├── README.md
    ├── SECURITY.md
    ├── alertmanager
    │   └── alertmanager.yml
    ├── docker-compose.yml
    ├── prometheus
    │   ├── alertrules.yml
    │   ├── prometheus.yml
    │   └── targets.json
    └── terraform-aws
     ├── README.md
     ├── modules
     │   ├── ec2
     │   │   ├── main.tf
     │   │   ├── outputs.tf
     │   │   ├── user-data.sh
     │   │   └── variables.tf
     │   └── security-group
     │       ├── main.tf
     │       ├── outputs.tf
     │       └── variables.tf
     ├── prometheus-stack
     │   ├── main.tf
     │   ├── outputs.tf
     │   └── variables.tf
     └── vars
         └── ec2.tfvars
    

    prometheus.yaml

  10. Select file “prometheus.yml” to view/edit.

  11. For comparison, a full sample file is at:

    https://github.com/prometheus/prometheus/blob/release-2.49/config/testdata/conf.good.yml

    The section headings:

    global:
    rule_files:
    remote_write:
    remote_read:
    scrape_configs:
      - job_name: ...
    alerting:
    storage:
    tracing:
    

    Start Prometheus

  12. Navigate to your Prometheus folder. The server references configuration files named prometheus.yml

  13. At a folder containing prometheus.yaml, run the executable installed by brew:

    prometheus

    Alternately, to run Prometheus server using a custom config ymal file name:

    ./prometheus --config.file="prometheus-demo-a.yml"
    ts=2024-02-04T23:14:14.593Z caller=main.go:544 level=info msg="No time or size retention was set so using the default time retention" duration=15d
    ts=2024-02-04T23:14:14.593Z caller=main.go:588 level=info msg="Starting Prometheus Server" mode=server version="(version=2.49.1, branch=non-git, revision=non-git)"
    ts=2024-02-04T23:14:14.593Z caller=main.go:593 level=info build_context="(go=go1.21.6, platform=darwin/amd64, user=brew@Ventura, date=20240115-16:56:27, tags=netgo,builtinassets,stringlabels)"
    ts=2024-02-04T23:14:14.593Z caller=main.go:594 level=info host_details=(darwin)
    ts=2024-02-04T23:14:14.593Z caller=main.go:595 level=info fd_limits="(soft=524288, hard=unlimited)"
    ts=2024-02-04T23:14:14.593Z caller=main.go:596 level=info vm_limits="(soft=unlimited, hard=unlimited)"
    ts=2024-02-04T23:14:14.599Z caller=web.go:565 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
    ts=2024-02-04T23:14:14.600Z caller=main.go:1039 level=info msg="Starting TSDB ..."
    ts=2024-02-04T23:14:14.602Z caller=tls_config.go:274 level=info component=web msg="Listening on" address=[::]:9090
    ts=2024-02-04T23:14:14.602Z caller=tls_config.go:277 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
    ts=2024-02-04T23:14:14.606Z caller=head.go:606 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
    ts=2024-02-04T23:14:14.608Z caller=head.go:687 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=306.061µs
    ts=2024-02-04T23:14:14.608Z caller=head.go:695 level=info component=tsdb msg="Replaying WAL, this may take a while"
    ts=2024-02-04T23:14:14.612Z caller=head.go:766 level=info component=tsdb msg="WAL segment loaded" segment=0 maxSegment=0
    ts=2024-02-04T23:14:14.612Z caller=head.go:803 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=635.216µs wal_replay_duration=3.530458ms wbl_replay_duration=201ns total_replay_duration=4.53585ms
    ts=2024-02-04T23:14:14.617Z caller=main.go:1060 level=info fs_type=1a
    ts=2024-02-04T23:14:14.617Z caller=main.go:1063 level=info msg="TSDB started"
    ts=2024-02-04T23:14:14.617Z caller=main.go:1245 level=info msg="Loading configuration file" filename=prometheus.yml
    ts=2024-02-04T23:14:14.783Z caller=main.go:1282 level=info msg="Completed loading of configuration file" filename=prometheus.yml totalDuration=165.456586ms db_storage=6.275µs remote_storage=4.234µs web_handler=434ns query_engine=1.108µs scrape=162.131337ms scrape_sd=29.266µs notify=24.794µs notify_sd=7.148µs rules=2.483027ms tracing=14.689µs
    ts=2024-02-04T23:14:14.783Z caller=main.go:1024 level=info msg="Server is ready to receive web requests."
    ts=2024-02-04T23:14:14.783Z caller=manager.go:146 level=info component="rule manager" msg="Starting rule manager..."
    

    Pre-initialize

  14. Pre-initialize all important label combinations to default values when the program first starts.

    QUESTION: Julius says: When using metrics with label dimensions, the time series for any label combination will only appear in the /metrics output once that label combination has been been accessed at least once. This can cause problems in PromQL queries that expect certain series to always be present.

    Debugging errors

  15. If you see errors such as this:

    ts=2024-02-04T23:15:42.822Z caller=notifier.go:530 level=error component=notifier alertmanager=http://127.0.0.1:9093/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://127.0.0.1:9093/api/v2/alerts\": dial tcp 127.0.0.1:9093: connect: connection refused"
    ts=2024-02-04T23:16:57.812Z caller=notifier.go:530 level=error component=notifier alertmanager=http://127.0.0.1:9093/api/v2/alerts count=1 msg="Error sending alert" err="Post \"http://127.0.0.1:9093/api/v2/alerts\": dial tcp 127.0.0.1:9093: connect: connection refused"
    

Setup Go

My instructions on how to setup Golang

Configuration

  1. Identify your client machine’s IP address:

  2. Define storage location:

    sudo mkdir -p /etc/prometheus
  3. Estimate space usage:

    size = time * sample rate * bytes per sample

  4. Construct

    Define where to store the TSDB:

    --storage.tsdb.path

    Define removal of data after this length of time:

    --storage.tsdb.retention-time

    Define removal of data after amount of space used (like in Splunk):

    --storage.tsdb.retention-size
  5. VIDEO: Verify release

    curl -s localhost:9090/api/v1/status/buildinfo | python3 -m json.tool
    {
    "status": "success",
    "data": {
       "version": "2.28.1",
       "revision": "d039c3e1c",
       "branch": "HEAD",
       "buildUser": "root@fa123cd1234",
       "buildDate": "2021-05-26T14:28:09Z",
       "goVersion": "go1.16.5"
    }
    }
    

    Upgrade data

    To ensure that data is not lost, the upgrade process is to stop the old Prometheus server, install the new version, and then start the new version.

  6. PROTIP: Setup a symlink to the physical location of the binary using the ln (link) command:

    sudo ln -s /usr/local/bin/prometheus /usr/local/bin/prometheus-2.2.0.linux-amd64/prometheus
    

    Some prefer this to renaming the downloaded file so that multiple versions can reside on the server in case response to a vulnerability requires falling back to a previous version.

  7. Define a special port for versions under test:

    ./prometheus --web.listen-address localhost:9091
  8. Start Prometheus server gracefully:

    systemct stop prometheus
  9. Run:

    ./prometheus --config.file prometheus.yml

scrape_configs

On the Prometheus server, edit the prometheus.yml file to add a new scrape_configs section to recognize the exporter on host at 172.31.122.23:

scrape_configs:
  - job_name: "prometheus"
    metrics_path: "/metrics"
    static_configs:
    - targets:
      - "localhost:9090"
  - job_name: node
    file_sd_configs:
    - files:
      - "/etc/prometheus/file_sd/node.yml"
  - job_name: 'node_exporter'
    static_configs:
    - targets: ['172.31.122.23:9100']

Restart the Prometheus server to pick up the new configuration.


Docker

  1. Confirm the creation of the existing Docker image:

    docker image list

    The response lists “forethought” as a Docker image.

  2. List the contents of the forethought directory and subdirectories:

    ls -d
  3. Deploy the web application to a container. Map port 8080 on the container to port 80 on the host:

    docker run –name ft-app -p 80:8080 -d forethought

  4. Check that the application is working correctly by visiting the server’s provided URL.

    In the script, this is done using a curl script and examining the HTML response.

  5. Install

    NOTE: The Terminal is inside a Dockerized Ubuntu (18.04 Bionic Beaver LTS) image. So apt-get commands are used to install Prometheus, Alertmanager, and Grafana.

    The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.

    Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.

    Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.

    The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.


Linux Metrics

https://training.promlabs.com/training/monitoring-linux-host-metrics


Monitoring Kubernetes

Prometheus joined the CNCF (Cloud Native Computing Foundation) in 2016 as its second hosted project after Kubernetes. So naturally, Prometheus works with K8s.

blank” href=”https://github.com/coreos/prometheus-operator”>https://github.com/coreos/prometheus-operator</a> and https://devops.college/prometheus-operator-how-to-monitor-an-external-service-3cb6ac8d5acb

PROTIP: Prometheus has not reached “1.0” yet so use of apt-get, yum, brew, installer packages are not recommended at this time for production use. But that hasn’t stopped many from using it in production.

$ cd /tmp
$ wget https://github.com/prometheus/prometheus/releases/download/v2.2.0/prometheus-2.2.0.linux-amd64.tar.gz
$ tar -xzf prometheus-2.2.0.linux-amd64.tar.gz
 
$ sudo chmod +x prometheus-2.2.0.linux-amd64/{prometheus,promtool} 
$ sudo cp prometheus-2.2.0.linux-amd64/{prometheus,promtool} /usr/local/bin/
$ sudo chown root:root /usr/local/bin/{prometheus,promtool}
 
$ sudo mkdir -p /etc/prometheus
$ sudo vim /etc/prometheus/prometheus.yml
$ promtool check config prometheus.yml
 
Checking prometheus.yml
SUCCESS: 0 rule files found
 
$ prometheus --config.file "/etc/prometheus/prometheus.yml" &

Prometheus on Kubernetes

As with most things Kubernetes, a Helm chart can do it all:

  • VIDEO: Setup Prometheus Monitoring on Kubernetes using Helm and Prometheus Operator | Part 1 - by TechWorld with Nana 2020

https://gitlab.com/nanuchi/youtube-tutorial-series/-/blob/master/prometheus-exporter/install-prometheus-commands.md

Starting Prometheus in Docker

To run Prometheus after downloading the Docker image from the “prom” account in Dockerhub:

docker run -p 9090:9090 -v /tmp/prometheus.yml:/etc/prometheus.yml prom/prometheus

Start Docker and try again if you get this error message:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

The expected message is:

msg="Server is ready to receive web requests."

The location of the database and the retention period are controlled by command line options: Add –storage.tsdb.path for another path. Add –storage.tsdb.retention to specify another retention period than the default 15d (days).

  1. Open a browser to see the log at:

    http://localhost:9090/metrics

    There is no menu item to view the page.

  2. Open a browser to see the Graph at the URL home page:

    http://localhost:9090

    prometheus-graph-menu-403x380-51898.jpg

    The above example are metrics for the Go language/virtual machine running locally.

    NOTE: https://prometheus.io/docs contains docs. It says in 2012 SoundCloud wrote Prometheus in Golang and open sourced it at https://github.com/prometheus.


Graphing specs

  1. TODO: Select “go_gc_duration_seconds” for the median, which is 50th quantile, specified as:

    rate(prometheus_tsdb_head_samples_appended_total[1m])

    Also:

    go_gc_duration_seconds{instance="localhost:9090",job="prometheus",quantile="0.5"}

    See https://prometheus.io/docs/prometheus/latest/storage

  2. Press Execute.
  3. Click “Graph”.

    Notice the granularity of timing on the horizontal axis. Thousands of a second.

    Configuring Prometheus.yml

  4. Open a browser to

    http://localhost:9090/config

    prometheus.yml is the configuration file that contains these blocks: global, rule_files, and scrape_configs. Optionally, there are remote_read, remote_write, alerting.

    global:
      evaluation_interval: 15s
      scrape_interval: 15s
      scrape_timeout: 10s
     
     external_labels:
     environment: localhost.localdomain
    

    In the global block, scrape_interval specifies the frequency of 15s (seconds) which Prometheus scrapes targets. (The default for this is every 60 seconds)

    The /targets page shows the most recent scrape error from among all targets that cannot be scraped.

    The default evaluation_interval of 15s controls how often Prometheus evaluates rule files that specify the creation of new time series and generation of alerts.

    Its uniqueness is a rules engine that enables alerts by the Prometheus Alertmanager installed separately.

    Rules Files

  5. Open a browser to

    http://localhost:9090/rules

    Here are both recording rules and alerting rules.

    Recording Rules

    TUTORIAL:

    In the rule_files section are yml file names. Each yml file contains groups of recording rules that define how to periodically precompute query results as new series (with a new name) in the TSDB.

    https://prometheus.io/docs/practices/rules/

    Each record:</strong> in a rule is executed in the sequence defined. Each record can, optionally, have key : value labels.

    The record’s expr (expression) defines how to calculate the metric.

    The naming conventions for records names has three parts:

    aggregation level : original metric name : aggregation type (dimensional, type of rate, etc.)

    Examples:

    - record: path:http_requests:rate5m
     expr: sum without(instance) (rate(http_requests_total{job="job1"}[5m]))
    
    • path specifies that the metric is broken up by the path label.
    • rate5m specifics a 5-minute rate.

    - record: instance_mode:node_cpu:rate1m
     expr: sum without(cpu) (rate(node_cpu_seconds_total{job="node"}[1m]))
     labels:
       my_label: my_value
    
    • “instance_mode:node_cpu:rate1m”

    QUESTION: sum without(status, instance, cpu)

    Referencing a precomputed series is less resource-intensive than repeated making the same calculations frequently for the same query. Instead of multiple dashboards referencing a series, recording rules can scrape metrics from one TSDB into another (to aggregate) – to federate (share) the result of a PromQL expression.

    To produce alerts with a custom threshold for each path’s current error rate, use time series as alert thresholds. Define recording rules with the same output metric name, but different path label values. This enables different error rate thresholds to be synthetically recorded for each path of an HTTP-serving application. Then a single alerting rule references the generated threshold metric name.

    The default interval is 1 minute. That can be changed by the global.evaluation_interval field in the Prometheus configuration file or by the rule-group-specific interval override.

    According to Julius, During each rule evaluation cycle, Prometheus will:

    1. Evaluate the rule’s PromQL expression as an instant query at the server’s current timestamp,
    2. Apply label overrides (if any) to the output series,
    3. Apply sample output limits (if any) and fail the evaluation if the limit is exceeded,
    4. Store the resulting time series in the local TSDB under the provided new metric name and at the evaluation timestamp.

    alertmanager server

    TRAINING:

    The Prometheus Alert Manager is a separate utility server that sends alerts to various endpoints</a> when PromQL expressions in Prometheus determine when each alert needs to be sent.

    REMEMBER: Alert Rules are read by the Prometheus server (not the Alertmanager).

    An Alertmanager can receive alerts from several Prometheus servers. NOTE: Under development: A cluster of Alertmanager instances form a mesh configuration to ensure High Availability.

    With Prometheus, Alerting, Dashboarding and adhoc querying all act from a TSDB.

    alertmanager install

    I’ve automated the install of with this Bash script. The latest version is automatically identified and renamed to just “alertmanager”.

    run.sh
    

    alertmanager config

    In the alertmanager.yml file, define the Alertmanager server:

    alerting:
    alertmanagers:
    - scheme: https
       static_configs:
       - targets:
          - "localhost:9093"
    

    REMEMBER: Alertmanager uses port 9093 by default.

    Docker setup

    defaults:

    • group_wait: 30s = how long buffer alerts in same group waits before initial notification
    • group_interval: 5m = how long before new alert notification sent to group already notified
    • repeat_interval: 4h = How long before re-sending notification for same alert

    Alerting Rules

    Based on samples from Julius and Craig:

    expr (expression) is the condition which triggers the alert.

    groups:
           - name: demo-alerts
      rules:
             - alert: InstanceDown
     expr: up == 0
     for: 1m
     labels:
       severity: critical
     annotations:
       summary: Instance is down
             - alert: HighErrorRate    # The name of the 2nd alert.
     expr: |                 # A PromQL expression whose output series become alerts.
       (
         sum by(path, instance, job) (
           rate(demo_api_request_duration_seconds_count{status=~"5..",job="demo"}[1m])
         )
       /
         sum by(path, instance, job) (
           rate(demo_api_request_duration_seconds_count{job="demo"}[1m])
         ) * 100 > 0.5
       )
     for: 5m                 # How long each result time series needs to be present to become a firing alert.
     labels:                 # Extra labels to attach for routing.
       severity: critical
     annotations:            # Non-identifying annotations that can be used in Alertmanager notifications.
       title: " high 5xx rate on "
       description: "The 5xx error rate for path  on  is %."
    

    The 2nd alerting rule generates critical alerts for any path/instance/job label combinations that have an error rate larger than 0.5% for at least 5 minutes.

    A different label specify a different app/team.

    • deduplication (throttling and repetition)

    inhibit alerts about servers downstream if network is down.

    inhibit_rules:
    - source_match:
      service: 'network'
    target_match:
      service: 'servers'
    
    • silencing

    QUESTION: Instead of the IP address, specify server name or code to lookup metadata using in info:</strong> entry under annotations with a GO Template</a> such as

    /youpath/alertmanager/templates/custom.tmpl

    QUESTION: Insert and remove server configs automatically as they are created/destroyed automatically.

    severity: critical or warning

    Alerts are routed to an endpoint based on the label sets defined.

    https://prometheus.io/docs/alerting/latest/configuration/#receiver “We are not actively adding new receivers. We recommend implementing custom notification integrations via the webhook receiver.

    Endpoints Receivers include:

    • webhook [ tutorial] TODO: to Twilio to send mobile phone SMS internationally.
    • email [ tutorial] TODO: Get credentials from secrets manager instead of hard-coding auth_password
    • hipchat
    • pagerduty
    • pushover
    • slack [ tutorial]
    • opsgenie
    • victorops
    • zulip.com chat

    amtool

    Alertmanager doesn’t start if default route is not specified.

    ./amtool check-config alertmanager.yml
    

    Scrape configs

    This defines the job that scrapes the Prometheus web UI:

    scrape_configs:
      - job_name: 'prometheus'
        metrics_path: "/metrics"
        static_configs:
        - targets: ['localhost:9090']
             - job_name: node
     file_sd_configs:
     - files:
       - "/etc/prometheus/file_sd/node.yml"
    

    There can be several jobs named in a config, named x, y, and z in the sample config file.

    Silence

  6. When working on an issue, silence notifications.

  7. When fixed, expire</strong. silence.

  8. Previous silenced alarts can be reactivated for a period of time.

Local start

Alternately,

PROTIP: Using /etc/prometheus would require sudo, but ~/.prometheus would not.

  1. Create a folder to hold the Prometheus configuration file, then CD to it:

    cd ~ ; mkdir .prometheus ; cd .prometheus
  2. Create a Prometheus configuration file in the folder or copy in then edit a full template example at:

    https://github.com/prometheus/prometheus/blob/release-2.3/config/testdata/conf.good.yml

  3. Validate yaml syntax online:

    https://github.com/adrienverge/yamllint

  4. Validate for content using the promtool in the Prometheus bin folder:

    promtool check config prometheus.yml

    An example error message is:

    Checking prometheus.yml
      FAILED: parsing YAML file prometheus.yml: yaml: line 13: did not find expected '-' indicator
    

    The expected response is: “SUCCESS: 0 rule files found”.

  5. To run Prometheus locally in the directory containing the Prometheus binary:

    <pre><strong>./prometheus --config.file=prometheus.yml</strong></pre>
    

    Additional parameters, for example:

    level=info ts=2017-10-23T14:03:02.274562Z caller=main.go:216 msg="Starting prometheus"...

Ansible installer

Paweł Krupa (@paulfantom, author of the Docker Workshop) and Roman Demachkovych (@rdemachkovych), together as Cloud Alchemy, defined a presentation about their Ansible role for Prometheus, with https://demo.cloudalchemy.org.

  • Zero-configuration deployment
  • Easy management of multiple nodes
  • Error checking
  • Multiple CPU architecture support

  • versioning
  • system user management
  • CPU architecture auto-detection
  • systemd service files
  • linux capabilites support
  • basic SELinux (Security-Enhanced Linux) security module support

https://travis-ci.org/cloudalchemy/demo-site

Command

# Ansible managed file. Be wary of possible overwrites.
[Unit]
Description=Prometheus
After=network.target
 
[Service]
Type=simple
Environment="GOMAXPROCS=1"
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \     
  --storage.tsdb.retention=30d \                
  --web.listen-address=0.0.0.0:9090 \
  --web.external-url=http://demo.cloudalchemy.org:9090
 
SyslogIdentifier=prometheus                                                                                              Restart=always
 
[Install]
WantedBy=multi-user.target

App Metrics

The four golden signals of monitoring begins with:

  • Latency

    The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

  • Traffic

    A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.

    To identify bottlenecks, instead of beginning with given metrics (partial answers) and trying to work backwards, the Utilization Saturation and Errors (USE) Method by Brendan Gregg (of Netflix), described at http://www.brendangregg.com/usemethod.html, begins by posing questions off a checklist, and then seeks answers. To direct the construction of a checklist, which for server analysis can be used for quickly identifying resource bottlenecks or errors.

  • Utilization

    the average time that the resource was busy servicing work.

  • Errors

    The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, “If you committed to one-second response times, any request over one second is an error”). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

  • Saturation

    How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential. In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., “Give me a nonce” or “I need a globally unique monotonic integer”) that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.

Predictive: saturation is the basis for projections of impending issues, such as “at the current rate, your database will fill its hard drive in 4 hours.”

graph

VIDEO: Type “prometheus” for the auto-complete to show the default raw metrics built-in:

  • prometheus_api_remote_read_queries
  • prometheus_build_info
  • prometheus_config_last_reload_success_timestamp_seconds
  • prometheus_config_last_reload_successful
  • prometheus_engine_queries
  • prometheus_engine_queries_concurrent_max
  • prometheus_engine_query_duration_seconds
  • prometheus_engine_query_duration_seconds_count
  • prometheus_engine_query_duration_seconds_sum
  • prometheus_engine_query_log_enabled
  • prometheus_engine_query_log_failures_total
  • prometheus_http_request_duration_seconds_bucket
  • prometheus_http_request_duration_seconds_count
  • prometheus_http_request_duration_seconds_sum
  • prometheus_http_requests_total
  • prometheus_http_response_size_bytes_bucket


App Instrumentation

“Instrumentation” is the process of adding code to your application so it exposes metrics to Prometheus.

Prometheus scrapes metrics from apps setup to expose exposition metrics</a> through a /metrics path.

PROTIP: For insight over what is happening inside your application or library: Julius recommends direct instrumentation: track and expose Prometheus metrics directly from your own applications and services rather than using exporters for code.

  • https://www.tigera.io/learn/guides/prometheus-monitoring/prometheus-metrics/

  • https://prometheus.io/docs/practices/instrumentation/

  • https://prometheus.io/docs/instrumenting/exporters/

Video course “Instrumenting Applications with Metrics for Prometheus” by Elton Stoneman (sixeyed.com) and https://observability.courselabs.co/

The course makes use of Docker containers created

  1. Install Docker and Prometheus.
  2. Download the files. Navigate to folder 02
  3. Start the servers:

    docker-compose up -d

  4. Logs from the web app:

    docker-compose logs web

    Client libraries

  5. App code needs to be configured with a client library. Official and community

    https://prometheus.io/docs/instrumenting/clientlibs/

    Embed official client libraries:

    Unofficial third-party client libraries:

  6. C# .NET programs need to be configured with client libraries from:

    https://www.nuget.org/packages?q=prometheus

    The C# .NET .csproj file would contain:

    <ItemGroup>
    <Package Reference Include="prometheus-net.AspNetCore" Version="3.6.0"/>
    <ItemGroup>
    
  7. Julius has a tutorial about instrumenting his sample Go app conveniently running in GitPod:

    git clone https://github.com/promlabs/go-instrumentation-exercise

    https://pkg.go.dev/github.com/prometheus/client_golang/prometheus

    https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/promhttp

    https://pkg.go.dev/github.com/prometheus/client_golang/prometheus/promauto

    Initialize the Go module configuration:

    go mod init instrumentation-examples

    His solution branch of the repository:
    git checkout instrumented

  8. View file docker-compose-prometheus.yaml

    version: '3.7'
    services:
      prometheus:
     image: psodpromapps/prometheus:m2
     build:
       context: prometheus
     ports:
       - "9090:9090"
     networks:
       - wb-net
    
  9. The latest version:

    QUESTION

  10. Start Prometheus

    docker-compose -f docker-compose.yaml \
    -f docker-compose-prometheus.yaml up -dated
    

The “bag” of metrics that should all be exposed together as part of a scrape is called a “metrics registry”.

Metrics exposition

Metrics are made available from a target’s Node Exporter by exposing an unencrypted (HTTP) URL such as:

Space lines added for clarity

Each metric is preceded by comments starting with “#”.

The format of the metrics exposition format (shown below) has evolved into the OpenMetrics CNCF open standard defined at https://github.com/OpenObservability/OpenMetrics, with discussions at https://groups.google.com/g/openmetrics

# HELP node uname info from the uname system call
# TYPE node_uname_info gauge
node_uname_info{domainname="(none)",machine="x86_64",nodename="localhost.localdomain",release="4.15.0-20-generic",sysname="Linux",version="#21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018"} 1
 
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 32
 
# HELP http_request_duration_microseconds The HTTP request latencies in microseconds.
# TYPE http_request_duration_microseconds summary
http_request_duration_microseconds{handler="prometheus",quantile="0.5"} 73334.095
 
# HELP dotnet_total_memory_bytes Total known allocated memory
# TYPE dotnet_total_memory_bytes gauge
# TYPE node_filefd_allocated gauge
dotnet_total_memory_bytes 363222
 
# HELP dotnet_collection_count_total GC collection count
# TYPE dotnet_collection_count_total counter
dotnet_collection_count_total{generation="0"} 0
dotnet_collection_count_total{generation="1"} 0
dotnet_collection_count_total{generation="2"} 0
 
# HELP node_filefd_allocated File descriptor statistics: allocated.
# TYPE node_filefd_allocated gauge
node_filefd_allocated 1184
 
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="sda"} 104.296
 
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="sda"} 0
 
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.048576e+06
 
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="sda"} 104.296
 
# HELP worker_queue_length The length of the queue of pending requests.
# TYPE worker_queue_length gauge
worker_queue_length 0
 
# HELP worker_jobs_total Worker jobs handled
# TYPE worker_jobs_total counter
worker_jobs_tota{status="processed"} 1570222
worker_jobs_total{status="failed"} 122
 
# HELP worker_jobs_active Worker jobs in process
# TYPE worker_jobs_active gauge
worker_jobs_active 10
 
# HELP process_open_handless Number of open handles
# TYPE process_open_handless gauge
process_open_handless 10
 
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.01
   
  • Timestamps are 64-bit integers in millisecond precision (tenths of a second), NOT nanosecond.
  • Sample values are 64-bit floating point numbers (allowing integer precision up to 2^53). In the future, can be a values histogram.

  • To colorize metrics output in browsers, install the “Prometheus Formatter” extension for Chrome and Firefox from https://github.com/fhemberger/prometheus-formatter created by fhemberger.

https://prometheus.io/docs/instrumenting/exposition_formats/


Exporters

Exporters are installed on servers to translate existing metrics into a format that Prometheus can scrape. Stock exporters are provided at: https://prometheus.io/download/#prometheus

PROTIP: Listed here by default port number:

Pre-defined third-party exporters and software exposing Prometheus metrics are listed at
https://prometheus.io/docs/instrumenting/exporters

A whole list of 3rd-party exporters:
https://prometheus.io/docs/instrumenting/exporters/#third-party-exporters

Some third-party software components or devices cannot be instrumented directly with Prometheus-compatible metrics.

Best Practices

Metric naming

Label usage

MySQL setup

  1. For the MySQL prompt:

    mysql -u root -p
  2. Create a database user for the exporter to use.
  3. Grant permissions
  4. Provide credentials to exporter

CREATE USER 'mysqld_exporter'@'localhost' IDENTIFIED BY 'password' WITH MAX_USER_CONNECTIONS 3; 
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'mysqld_exporter'@'localhost';
FLUSH PRIVILEGES; 
export DATA_SOURCE_NAME='mysqld_exporter:password@(localhost:3306)/';

https://github.com/orgs/prometheus/projects/8 is used to track work on new exporters.

https://training.promlabs.com/training/understanding-and-building-exporters/ Create exporters on more systems.

WMI exporter on Windows

The WMI Exporter provides system metrics for Windows servers.

Setup NodeJs Exporter

Node (JavaScript) is a popular language for web apps.

Prometheus Node Explorer has its own repo at

To download a release from GitHub:

https://github.com/prometheus/node_exporter/releases

# TODO: Identify latest version URL in https://prometheus.io/download/#node_exporter
# TODO: Code different downloads for Darwin vs. other OS:
wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz
   # https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.darwin-386.tar.gz
   # v0.16.0 is dated 2018-05-15
tar -xzf node_exporter-*
sudo cp node_exporter-*/node_exporter /usr/local/bin/
node_exporter --version

A sample response (at time of writing):

node_exporter, version 0.16.0 (branch: HEAD, revision: 6
e2053c557f96efb63aef3691f15335a70baaffd)
. . .

The node_exporter exporter runs, by default, on port 9100 to expose metrics, but can be changed:

node_exporter --web.listen-address=":9100" \
   --web.telemetrypath="/node_metrics"
   
# Get the exporter tar:
wget https://github.com/prometheus/node_exporter/releases/download/v*/node_exporter-*.*linux-amd64.tar.gz
# unpack:
tar xvfz node_exporter-*.*-amd64.tar.gz
# Move into folder:
cd node_exporter-*.*-amd64
# Get menu of parms and defaults:
./node_exporter --help
# Start the exporter:
./node_exporter --web.listen-address=":9100" > node.out 2>&1
# On a client machine, Check that metrics are publishing
curl http://localhost:9100/metrics

AWS

PROTIP: In AWS S3, follow your organization’s Least-Privilege security policies by not allowing a single account all Actions for Prometheus:

    "Action": [ "s3:PutObject", "s3:GettObject", "s3:PutObjectAcl", "s3:DeleteObject", "s3:ListBucket", "s3:GetBucketLocation" ]

Monitoring others

systems, APIs, logging, storage, hardware related, HTTP, etc.


Creating Exporters

https://prometheus.io/docs/instrumenting/writing_exporters/


Metric types

VIDEO: by Julius. Julius’ blog

  • Counters increments

    VIDEO: Counters track the total number of metrics Prometheus has ingested so far: PROTIP: Metrics that keep going up and up in a graph would eventually blow up. So it’s better reported as a rate per minute. To define the 90th percentile value on a histogram:  

    histogram_quantile(0.9, sum by(le, path) rate(demo_api_request_duration_seconds_bucket[5m])
  • Gauges tally

    Guages tally values that naturally go up and down, such as the temperature, amount of gasoline in vehicles or capacity of memory and CPU used, and the length of items in queues awaiting processing. Each guage has a maximum capacity and usually a minimum of zero.

    An exposition format:

    # HELP process_open_fds Number of open file descriptors.
    # TYPE process_open_fds gauge
    process_open_fds 15
     
  • histogram count of value buckets

    Histograms report a count of how many were observed within each bucket of a distribution. As Prometheus receives readings such as request duration (latency), it arranges them by value in an object setup for that work.

    When it comes time to publish, the value of each item observed is the basis for them to be counted within several buckets. If 6 buckets are used, the exposition would have 6 lines.

    # HELP http_request_duration_seconds A histogram of the request duration.
    # TYPE http_request_duration_seconds histogram
    http_request_duration_seconds_bucket{le="0.025"} 20
    http_request_duration_seconds_bucket{le="0.05"} 60
    http_request_duration_seconds_bucket{le="0.1"} 90
    http_request_duration_seconds_bucket{le="0.25"} 100
    http_request_duration_seconds_bucket{le="+Inf"} 105
    http_request_duration_seconds_sum 21.322
    http_request_duration_seconds_count 105
     

    We don’t want to specify buckets by the values because that changes over time. So we specify the buckets using a frequency of occurrence number between 0 and 1.0 (expressed as “+Inf” for infinity, which covers the values observed in the bucket containing the highest values). A 0.5 would cover 50% of the values.

    The frequency number is specified in “le” (less than or equal to) a number to the previous bucket.

    Histograms provide a count of outliers when request durations (latencies) are too slow. Alerts would trigger if the number in the highest bucket exceeds a specified occurrence.

    Julius notes that when aggregating across dimensions (such as endpoint, HTTP method, etc.) use histogram rather than summary. He explains why in his tutorial.

  • Summary quartiles

    “Quartiles” is a misnomer because it doesn’t really mean quarters.

    Quartiles in Prometheus summary metric types are really percentiles, but specified in a 0 to 1 scale rather than 0 to 100 as normally specified in the industry.

    # HELP rpc_duration_seconds A summary of RPC durations in seconds.
    # TYPE rpc_duration_seconds summary
    rpc_duration_seconds{quantile="0.01"} 3.102
    rpc_duration_seconds{quantile="0.05"} 3.272
    rpc_duration_seconds{quantile="0.5"} 4.773
    rpc_duration_seconds{quantile="0.9"} 9.001
    rpc_duration_seconds{quantile="0.99"} 76.656
    rpc_duration_seconds_sum 5.7560473e+04
    rpc_duration_seconds_count 2693
     

    The lowest quartile may indicate durations that are too low, such as when an error occured.

    To create a summary, this from Julius to specify percentiles:

    requestDurations := prometheus.NewSummary(prometheus.SummaryOpts{
      Name:       "http_request_duration_seconds",
      Help:       "A summary of the HTTP request durations in seconds.",
      Objectives: map[float64]float64{
        0.5: 0.05,   // 50th percentile with a max. absolute error of 0.05.
        0.9: 0.01,   // 90th percentile with a max. absolute error of 0.01.
        0.99: 0.001, // 99th percentile with a max. absolute error of 0.001.
      },
    },
    )
     

    https://prometheus.io/docs/practices/histograms/


Operator

TBD


PromQL Query Language

  • https://training.promlabs.com/training/understanding-promql

Promethus provides multiple modes of graphing and dashboarding support, but also exposes its time-series data to API clients such as Grafana which makes PromQL (Prometheus query language) to extract data in order to display visualizations on their websites.

  • Selecting Data
  • Rates and Derivatives
  • Aggregating over time
  • Aggregating over dimensions
  • Binary operators
  • Histograms
  • Timestamp Metrics

  1. PROTIP: REMEMBER:

    https://promlabs.com/promql-cheat-sheet/

    • Selecting series
    • Rates of increase for counters
    • Aggregating over multiple series

    • Math between series
    • Filtering series by value
    • Set operations

    • Quantiles from histograms
    • Changes in gauges
    • Aggregating over time

    • Time
    • Dealing with missing data
    • Manipulating labels

    • Subqueries

  2. Sample alert rule</strong> for when the per-path error rate ratios larger than a percentage:

    alert: Many500Errors
    # This is the PromQL expression that forms the "heart" of the alerting rule.
    expr: |
      (
       sum by(path) (rate(http_requests_total{status="500"}[5m]))
     /
       sum by(path) (rate(http_requests_total[5m]))
      ) * 100 > 5
    for: 5m
    labels:
      severity: "critical"
    annotations:
      summary: "Many 500 errors for path  (%)"
    

Types of metrics generated by Prometheus:

  1. Counter of increasing value (such as packets received)
  2. Gauge - a current value that increases or decreases (such as memory usage)
  3. Histogram is a graphical display of value dispersion
  4. Summary presents an overview of totals.
histogram_quantile(
  0.90,
  sum without(code,instance)(
   rate(http_request_seconds[5m])
)))

Alerting Rules

https://prometheus.io/docs/practices/alerting/

https://prometheus.io/docs/alerting/latest/alertmanager/

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

Resources

“Monitoring with Prometheus” is 360 pages at https://prometheusbook.com is by James Turnbull, who also wrote books about other DevOps tools: Kubernetes, Packer, Terraform, Logstash, Puppet, etc. based on his work as CTO at Kickstarter, VP of Services and Support at Docker, VP of Engineering at Venmo, and VP of Technical Operations at Puppet. The book is hands-on for Prometheus version 2.3.0 (build date 20171006-22:16:15) on a Linux distribution. However, the author promises updates even though he is busy as CTO at Empatico. Code for the book is at:

Turnbull suggests monitoring for “correctness”, not just their status, starting with business metrics, then application (https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html#xref_monitoring_golden-signals), then operating system metrics to avoid “cargo cult” delusions. An example is monitoring for rates of business transactions rather than server uptime.

Bryan Brazil blogs about Prometheus at https://www.robustperception.io/blog/ The blog mentions his trainings. He is working on a on Safari Book “Prometheus: Up & Running”.

paulfantom/workshop-docker

Monitoring, the Prometheus Way May 8, 2017 by Julius Volz - Co-Founder, Prometheus

Infrastructure and application monitoring using Prometheus at Devox UK May 17, 2017 by Marco Pas

LinuxAcademy video hands-on courses:

  • Monitoring Infrastructure and Containers with Prometheus: Prometheus is used to monitor infrastructure and applications at multiple levels: on the host itself, on any containers, and on the application. This hands-on lab addresses monitoring of virtual machine host and containers. It begins by setting up monitoring for a virtual machine using Prometheus’s Node Exporter. Then set up container monitoring for the provided container using Google’s cAdvisor.

    View metrics in Prometheus across two levels of a system to track changes and view trends.

  • DevOps Monitoring Deep Dive by Elle Krout references an interactive Lucid diagram called “ProjectForethought” for the NodeJs simple to-do list program called Forethought that is the subject of monitoring.

    Create within Linux Academy’s Servers in the cloud, the “DevOps Monitoring Deep Dive” distribution in a small-sized host. It contains a Dockerized Ubuntu (18.04 Bionic Beaver LTS).

    So apt-get commands are used to install Prometheus, Alertmanager, and Grafana.

    docker run –name ft-app -p 80:8080 -d forethought

    The infrastructure is monitored by using Prometheus’s Node Exporter and viewed statistic about our CPU, memory, disk, file system, basic networking, and load metrics. Also monitored are contrainers being using on virtual machines.

    Once infrastructure monitoring is up and running, the basic Node.js application uses a Prometheus client libary to track metrics across the app.

    Finally, add recording and alerting rules, build out a series of routes so any alerts created get to their desired endpoint.

    The course also looks at creating persistent dashboards with Grafana and use its various graphing options to better track data.

Other notes

https://timber.io/blog/prometheus-the-good-the-bad-and-the-ugly/

https://eng.uber.com/m3/ Uber open-sourced their M3 Metrics platform for Prometheus in 2018 Some Uber employees went on to found Chronosphere, which is a hosted Prometheus service.

  • https://chronosphere.io/resource/explainer-video-chronosphere-overview/
  • https://www.youtube.com/watch?v=jIrru50I-WY

https://www.youtube.com/watch?v=h4Sl21AKiDg&pp=ygUKcHJvbWV0aGV1cw%3D%3D How Prometheus Monitoring works | Prometheus Architecture explained by TechWorld with Nana

https://github.com/akeylesslabs/helm-charts/blob/main/monitoring/akeyless-zero-trust-web-access/grafana-dashboard.yaml

https://financialit.net/news/apis/how-stash-keeping-its-platform-secure-amid-drive-integration

https://mattermost.com/blog/sloth-for-slo-monitoring-and-alerting-with-prometheus/ https://github.com/slok/sloth with Thanos ruler evaluates rules and sends alerts to Alertmanager. Prometheus and Grafana. Alertmanager integrates notifications and alerts with the Mattermost Community Server and OpsGenie.

An example of what Metrics documentation about its scrape_config looks like: https://docs-git-update-metrics-gatewaydio.vercel.app/using-gatewayd/global-configuration/metrics/

In the expression browser:

  • The Graph view graphs a PromQL expression over time.
  • The Table view provides a view of the output of a PromQL expression at one point in time.


/tsdb_status

The TSDB Status page shows detailed statistics about individual metric names and labels:

  • Top 10 label names with value count is used if some unintentional high-cardinality data (like public user IPs or full HTTP paths) are stored in label values, you will immediately see those problematic labels at the top here.

  • Top 10 label names with high memory usage helps to identify label names that may not have a too high number of values associated with them, but where individual values are overly large and memory-consuming.

  • Top 10 series count by metric names - to track down metric names that overall have a combination of labels attached to them that cause a large number of time series.

  • Top 10 series count by label value pairs when a specific label combination at the top here, like a histogram metric name (name=”my_histogram_bucket”) that has a lot of bucket time series attached.

Resources

https://developer.hashicorp.com/terraform/tutorials/enterprise/tfe-metrics

TechnoTim explains use of Rancher Monitoring from Helm charts in a Kubernetes cluster.

https://www.youtube.com/watch?v=TyBsKMTDl1Q Prometheus, Alert Manager, Email Notification & Grafana in Kubernetes Monitoring | Merciboi

https://www.youtube.com/watch?v=wFMEoWpVOYw Monitoring Docker Containers using Grafana & Prometheus

https://training.linuxfoundation.org/training/monitoring-systems-and-services-with-prometheus-lfs241/ $299 for 25 hours with labs: Monitoring Systems and Services with Prometheus (LFS241)

Thanos & promscale can dedup


More on Security

This is one of a series on Security and DevSecOps:

  1. Security actions for teamwork and SLSA
  2. DevSecOps

  3. Code Signing on macOS
  4. Transport Layer Security

  5. Git Signing
  6. GitHub Data Security
  7. Encrypt all the things

  8. Azure Security-focus Cloud Onramp
  9. Azure Networking

  10. AWS Onboarding
  11. AWS Security (certification exam)
  12. AWS IAM (Identity and Access Management)
  13. AWS Networking

  14. SIEM (Security Information and Event Management)
  15. Intrusion Detection Systems (Goolge/Palo Alto)
  16. Chaos Engineering

  17. SOC2
  18. FedRAMP
  19. CAIQ (Consensus Assessment Initiative Questionnaire) by cloud vendors

  20. AKeyless cloud vault
  21. Hashicorp Vault
  22. Hashicorp Terraform
  23. OPA (Open Policy Agent)

  24. SonarQube
  25. WebGoat known insecure PHP app and vulnerability scanners
  26. Test for OWASP using ZAP on the Broken Web App

  27. Security certifications
  28. Details about Cyber Security

  29. Quantum Supremecy can break encryption in minutes
  30. Pen Testing
  31. Kali Linux

  32. Threat Modeling
  33. WebGoat (deliberately insecure Java app)