TL;DR — Quick Summary

Set up Prometheus with Node Exporter for Linux server monitoring. Learn PromQL, alerting rules, Alertmanager, storage tuning, and security hardening.

Prometheus is the de-facto standard for open-source infrastructure monitoring. Combined with Node Exporter, it gives you deep visibility into every Linux server in your fleet — CPU usage, memory pressure, disk saturation, network throughput, and load averages — all queryable with a powerful expression language and alertable through Alertmanager. This guide walks from bare metal to a production-ready monitoring stack: architecture, installation, PromQL, alerting, Alertmanager, recording rules, storage tuning, and security hardening.

Prerequisites

  • Ubuntu 22.04 or Debian 12 (steps are identical on RHEL/Rocky with dnf substituted for apt)
  • Root or sudo access
  • Ports 9090 (Prometheus), 9100 (Node Exporter), and 9093 (Alertmanager) open in your firewall
  • Basic familiarity with systemd and YAML

Architecture Overview

Prometheus uses a pull model: it scrapes HTTP endpoints called targets at a configurable interval and stores the time-series data locally in a custom columnar database (TSDB). This is the opposite of push-based systems like StatsD.

┌─────────────────────────────────────────────────────────┐
│                     Linux Server                         │
│                                                          │
│   Node Exporter :9100  ←  Prometheus :9090  →  Grafana  │
│         ↑                      ↓                         │
│   /proc /sys              TSDB on disk                   │
│                                ↓                         │
│                          Alertmanager :9093              │
│                          (email/Slack/PD)                │
└─────────────────────────────────────────────────────────┘

Key concepts:

  • Scrape — Prometheus fetches /metrics from each target at scrape_interval
  • TSDB — Local time-series database; each metric is a float64 value with a nanosecond timestamp and a set of labels
  • PromQL — Query language for slicing, aggregating, and computing rates over time-series
  • Alertmanager — Separate process that receives firing alerts from Prometheus and routes notifications

Step 1: Install Prometheus

Create the system user and directories

sudo useradd --system --no-create-home --shell /bin/false prometheus

sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus

Download and install the binary

# Check https://github.com/prometheus/prometheus/releases for the latest version
PROM_VERSION="2.51.2"
cd /tmp
wget "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz"
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64

sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries

Write prometheus.yml

sudo nano /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    environment: "production"
    datacenter: "dc1"

rule_files:
  - /etc/prometheus/rules/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]
    labels:
      instance: "web-01"
      environment: "production"
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
# Validate syntax before starting
promtool check config /etc/prometheus/prometheus.yml

Create the systemd unit

sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle

Restart=on-failure
RestartSec=5s
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/prometheus

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus

Step 2: Install Node Exporter

Node Exporter exposes hardware and OS metrics from /proc and /sys on port 9100.

NODE_VERSION="1.7.0"
cd /tmp
wget "https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz"
tar xzf node_exporter-${NODE_VERSION}.linux-amd64.tar.gz

sudo useradd --system --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter-${NODE_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes

Restart=on-failure
RestartSec=5s
NoNewPrivileges=true
ProtectSystem=strict

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Verify metrics are exposed
curl -s http://localhost:9100/metrics | head -30

Key Node Exporter Metrics

MetricDescription
node_cpu_seconds_totalCPU time by mode (user, system, idle, iowait) — use rate()
node_memory_MemAvailable_bytesAvailable memory (includes reclaimable cache)
node_memory_MemTotal_bytesTotal physical memory
node_filesystem_avail_bytesFree disk space on a filesystem
node_filesystem_size_bytesTotal filesystem size
node_network_receive_bytes_totalBytes received per network interface — use rate()
node_network_transmit_bytes_totalBytes transmitted per network interface — use rate()
node_load1 / node_load5 / node_load15System load averages
node_disk_io_time_seconds_totalTime spent doing I/O — use rate() for saturation
node_time_secondsCurrent system time (for clock drift detection)
node_systemd_unit_stateState of systemd units when --collector.systemd enabled

PromQL Basics

PromQL is Prometheus’s query language. Open the Prometheus UI at http://your-server:9090 to experiment.

CPU usage percentage

# CPU usage across all cores, averaged
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage percentage

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk usage percentage

(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes)) * 100

Network throughput (bytes/sec)

rate(node_network_receive_bytes_total{device!="lo"}[5m])

Key PromQL functions

FunctionExamplePurpose
rate()rate(http_requests_total[5m])Per-second average rate over a range vector
irate()irate(cpu_seconds_total[5m])Instantaneous rate (last two samples) — spiky
increase()increase(errors_total[1h])Total increase over a range (= rate × duration)
histogram_quantile()histogram_quantile(0.99, rate(http_duration_bucket[5m]))P99 latency from a histogram metric
avg by()avg by (job) (metric)Average across a dimension
sum without()sum without (cpu) (metric)Aggregate dropping specified labels
topk()topk(5, metric)Top N time-series by value
predict_linear()predict_linear(disk_avail[6h], 3600*24)Predict value in N seconds

Alerting Rules

Create the rules directory and an alerts file:

sudo mkdir -p /etc/prometheus/rules
sudo nano /etc/prometheus/rules/alerts.yml
groups:
  - name: node_alerts
    interval: 30s
    rules:

      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% for more than 5 minutes."

      - alert: CriticalCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% — investigate immediately."

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free."

      - alert: CriticalDiskSpace
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes) * 100 < 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} is nearly full."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | printf \"%.1f\" }}%."

      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for 1 minute."

      - alert: HighLoad
        expr: node_load1 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High system load on {{ $labels.instance }}"
          description: "Load average per CPU is {{ $value | printf \"%.2f\" }}."
sudo chown -R prometheus:prometheus /etc/prometheus/rules
promtool check rules /etc/prometheus/rules/alerts.yml
sudo systemctl reload prometheus

Alertmanager Setup

Install Alertmanager

AM_VERSION="0.27.0"
cd /tmp
wget "https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz

sudo useradd --system --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp alertmanager-${AM_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /etc/alertmanager /var/lib/alertmanager

Configure routes

sudo nano /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts@example.com"
  smtp_auth_password: "your-smtp-password"
  slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"

route:
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "ops-team"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty"
      continue: true
    - match:
        severity: warning
      receiver: "slack-warnings"

receivers:
  - name: "ops-team"
    email_configs:
      - to: "ops@example.com"
        send_resolved: true

  - name: "slack-warnings"
    slack_configs:
      - channel: "#alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
        send_resolved: true

  - name: "pagerduty"
    pagerduty_configs:
      - routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "instance"]
sudo nano /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
After=network.target

[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager

Restart=on-failure
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager

Recording Rules

Recording rules pre-compute expensive PromQL expressions and store the result as a new metric. This dramatically speeds up dashboards that aggregate across hundreds of nodes.

# /etc/prometheus/rules/recording.yml
groups:
  - name: node_recording
    interval: 1m
    rules:
      - record: instance:node_cpu_usage:percent
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      - record: instance:node_memory_usage:percent
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

      - record: instance:node_filesystem_usage:percent
        expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes)) * 100

      - record: instance:node_network_receive_bytes:rate5m
        expr: sum by (instance) (rate(node_network_receive_bytes_total{device!="lo"}[5m]))

After writing this file, run promtool check rules /etc/prometheus/rules/recording.yml and reload Prometheus.


Storage Tuning

Prometheus TSDB compresses time-series data efficiently — expect about 1-2 bytes per sample. A server scraping 1,000 metrics every 15 seconds accumulates roughly 2 GB per month.

# In prometheus.service ExecStart flags:

# Keep data for 90 days
--storage.tsdb.retention.time=90d

# OR cap storage size (Prometheus will delete oldest blocks first)
--storage.tsdb.retention.size=20GB

# Tune the write-ahead log compression
--storage.tsdb.wal-compression

# Head block chunk range (default 2h, lower = less memory on restart)
--storage.tsdb.min-block-duration=2h

Check current storage usage:

du -sh /var/lib/prometheus/
# View TSDB stats in the UI
curl http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool

Security

Basic authentication (Prometheus 2.24+)

# Generate a bcrypt hash
htpasswd -nBC 10 admin
# Copy the hash output

sudo nano /etc/prometheus/web.yml
basic_auth_users:
  admin: "$2y$10$your-bcrypt-hash-here"
# Add to ExecStart in prometheus.service
--web.config.file=/etc/prometheus/web.yml

TLS termination with nginx reverse proxy

server {
    listen 443 ssl;
    server_name prometheus.example.com;

    ssl_certificate     /etc/letsencrypt/live/prometheus.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/prometheus.example.com/privkey.pem;

    auth_basic "Prometheus";
    auth_basic_user_file /etc/nginx/.prometheus-htpasswd;

    location / {
        proxy_pass http://127.0.0.1:9090;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Bind Prometheus to localhost only when using a reverse proxy:

--web.listen-address=127.0.0.1:9090

Firewall rules

# Block direct access — only nginx exposed externally
sudo ufw deny 9090
sudo ufw deny 9093
# Node Exporter: allow only from Prometheus server IP
sudo ufw allow from 10.0.0.5 to any port 9100

Comparison Table

FeaturePrometheusZabbixNagiosDatadogNetdata
ModelPull (scrape)Push/PullPull (active checks)Push (agent)Push (parent-child)
StorageLocal TSDBPostgreSQL/MySQLFlat filesSaaS cloudDB / Netdata Cloud
Query languagePromQL (powerful)Custom + SQLNoneDogStatsD metricsCustom
AlertingAlertmanagerBuilt-inBuilt-inBuilt-inBuilt-in
DashboardsGrafana (external)Built-inNone nativeBuilt-inBuilt-in
Auto-discoveryYes (file/SD)YesLimitedYes (Agent)Yes
CostFree / open-sourceFree (enterprise paid)Free (enterprise paid)SaaS pricingFree (cloud paid)
ScalabilityThanos/Cortex/MimirGood with proxyPoor (monolith)ManagedGood
Learning curveModerate (PromQL)SteepSteepLowVery low
Best forCloud-native, KubernetesEnterprise ITLegacy / SNMPManaged SaaSReal-time per-host

Production Stack Recipe

A minimal production setup for a fleet of Linux servers:

Prometheus server (t3.small or 2 vCPUs / 4 GB RAM):
  - /var/lib/prometheus  →  separate data volume (100 GB+)
  - Retention: 30d time  OR  20 GB size
  - Scrapers: node_exporter on every host, blackbox_exporter for HTTP checks

Node Exporter (on every server):
  - systemd service, port 9100
  - Collectors: --collector.systemd --collector.processes
  - Firewall: allow 9100 only from Prometheus server IP

Alertmanager:
  - Deduplicated alert routing: Slack for warnings, PagerDuty for critical
  - Inhibit rules: suppress warnings when critical fires for same instance

Grafana (optional but recommended):
  - Docker: docker run -d -p 3000:3000 grafana/grafana
  - Import dashboard ID 1860 (Node Exporter Full) from grafana.com/dashboards

Reverse proxy (nginx + TLS):
  - Prometheus: https://prometheus.internal.example.com  (VPN-only)
  - Grafana:    https://grafana.example.com  (public, auth required)

Gotchas and Edge Cases

rate() requires at least two samples. If a counter resets (process restart), rate() handles it gracefully. irate() is more sensitive to resets — use rate() for alerting rules.

Cardinality explosion kills performance. Never use high-cardinality labels like user_id, session_id, or request_url in metrics. Each unique label combination creates a separate time-series. A few thousand extra series slow queries; millions crash Prometheus.

for clause in alerts prevents flapping. An alert with for: 5m must be continuously firing for 5 minutes before it sends a notification. Omit for and a 1-second spike triggers a page.

Recording rules need time to populate. After adding a recording rule, the new metric only exists from that point forward — you cannot query it for historical data that predates the rule.

Scrape timeout must be less than scrape interval. If scrape_timeout (default 10s) exceeds scrape_interval, Prometheus logs errors. If your target is slow, increase scrape_interval for that job.

Node Exporter on Ubuntu 22.04 may need --collector.systemd flag. Without it, node_systemd_unit_state metrics are missing. Add the flag explicitly to the systemd ExecStart.


Summary

  • Prometheus uses a pull model — it scrapes /metrics endpoints at a configurable interval
  • Node Exporter exposes hardware and OS metrics from /proc//sys on port 9100
  • PromQL functions rate(), increase(), and histogram_quantile() are the core of useful queries
  • Write alerting rules with a for clause to prevent flapping; route through Alertmanager for deduplication and silencing
  • Recording rules pre-compute expensive aggregations — critical for multi-node dashboards
  • Tune storage retention with --storage.tsdb.retention.time or --storage.tsdb.retention.size
  • Always put Prometheus behind a reverse proxy with TLS and restrict direct port access via firewall
  • Avoid high-cardinality labels — they are the most common cause of performance problems