TL;DR — Quick Summary
Set up Prometheus with Node Exporter for Linux server monitoring. Learn PromQL, alerting rules, Alertmanager, storage tuning, and security hardening.
Prometheus is the de-facto standard for open-source infrastructure monitoring. Combined with Node Exporter, it gives you deep visibility into every Linux server in your fleet — CPU usage, memory pressure, disk saturation, network throughput, and load averages — all queryable with a powerful expression language and alertable through Alertmanager. This guide walks from bare metal to a production-ready monitoring stack: architecture, installation, PromQL, alerting, Alertmanager, recording rules, storage tuning, and security hardening.
Prerequisites
- Ubuntu 22.04 or Debian 12 (steps are identical on RHEL/Rocky with
dnfsubstituted forapt) - Root or sudo access
- Ports 9090 (Prometheus), 9100 (Node Exporter), and 9093 (Alertmanager) open in your firewall
- Basic familiarity with systemd and YAML
Architecture Overview
Prometheus uses a pull model: it scrapes HTTP endpoints called targets at a configurable interval and stores the time-series data locally in a custom columnar database (TSDB). This is the opposite of push-based systems like StatsD.
┌─────────────────────────────────────────────────────────┐
│ Linux Server │
│ │
│ Node Exporter :9100 ← Prometheus :9090 → Grafana │
│ ↑ ↓ │
│ /proc /sys TSDB on disk │
│ ↓ │
│ Alertmanager :9093 │
│ (email/Slack/PD) │
└─────────────────────────────────────────────────────────┘
Key concepts:
- Scrape — Prometheus fetches
/metricsfrom each target atscrape_interval - TSDB — Local time-series database; each metric is a float64 value with a nanosecond timestamp and a set of labels
- PromQL — Query language for slicing, aggregating, and computing rates over time-series
- Alertmanager — Separate process that receives firing alerts from Prometheus and routes notifications
Step 1: Install Prometheus
Create the system user and directories
sudo useradd --system --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
Download and install the binary
# Check https://github.com/prometheus/prometheus/releases for the latest version
PROM_VERSION="2.51.2"
cd /tmp
wget "https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz"
tar xzf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
sudo cp -r consoles console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries
Write prometheus.yml
sudo nano /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
environment: "production"
datacenter: "dc1"
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
labels:
instance: "web-01"
environment: "production"
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
# Validate syntax before starting
promtool check config /etc/prometheus/prometheus.yml
Create the systemd unit
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/introduction/overview/
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=30d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle
Restart=on-failure
RestartSec=5s
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/prometheus
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus
Step 2: Install Node Exporter
Node Exporter exposes hardware and OS metrics from /proc and /sys on port 9100.
NODE_VERSION="1.7.0"
cd /tmp
wget "https://github.com/prometheus/node_exporter/releases/download/v${NODE_VERSION}/node_exporter-${NODE_VERSION}.linux-amd64.tar.gz"
tar xzf node_exporter-${NODE_VERSION}.linux-amd64.tar.gz
sudo useradd --system --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter-${NODE_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Documentation=https://github.com/prometheus/node_exporter
After=network.target
[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes
Restart=on-failure
RestartSec=5s
NoNewPrivileges=true
ProtectSystem=strict
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
# Verify metrics are exposed
curl -s http://localhost:9100/metrics | head -30
Key Node Exporter Metrics
| Metric | Description |
|---|---|
node_cpu_seconds_total | CPU time by mode (user, system, idle, iowait) — use rate() |
node_memory_MemAvailable_bytes | Available memory (includes reclaimable cache) |
node_memory_MemTotal_bytes | Total physical memory |
node_filesystem_avail_bytes | Free disk space on a filesystem |
node_filesystem_size_bytes | Total filesystem size |
node_network_receive_bytes_total | Bytes received per network interface — use rate() |
node_network_transmit_bytes_total | Bytes transmitted per network interface — use rate() |
node_load1 / node_load5 / node_load15 | System load averages |
node_disk_io_time_seconds_total | Time spent doing I/O — use rate() for saturation |
node_time_seconds | Current system time (for clock drift detection) |
node_systemd_unit_state | State of systemd units when --collector.systemd enabled |
PromQL Basics
PromQL is Prometheus’s query language. Open the Prometheus UI at http://your-server:9090 to experiment.
CPU usage percentage
# CPU usage across all cores, averaged
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk usage percentage
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes)) * 100
Network throughput (bytes/sec)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
Key PromQL functions
| Function | Example | Purpose |
|---|---|---|
rate() | rate(http_requests_total[5m]) | Per-second average rate over a range vector |
irate() | irate(cpu_seconds_total[5m]) | Instantaneous rate (last two samples) — spiky |
increase() | increase(errors_total[1h]) | Total increase over a range (= rate × duration) |
histogram_quantile() | histogram_quantile(0.99, rate(http_duration_bucket[5m])) | P99 latency from a histogram metric |
avg by() | avg by (job) (metric) | Average across a dimension |
sum without() | sum without (cpu) (metric) | Aggregate dropping specified labels |
topk() | topk(5, metric) | Top N time-series by value |
predict_linear() | predict_linear(disk_avail[6h], 3600*24) | Predict value in N seconds |
Alerting Rules
Create the rules directory and an alerts file:
sudo mkdir -p /etc/prometheus/rules
sudo nano /etc/prometheus/rules/alerts.yml
groups:
- name: node_alerts
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}% for more than 5 minutes."
- alert: CriticalCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | printf \"%.1f\" }}% — investigate immediately."
- alert: LowDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes) * 100 < 15
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} has {{ $value | printf \"%.1f\" }}% free."
- alert: CriticalDiskSpace
expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes) * 100 < 5
for: 1m
labels:
severity: critical
annotations:
summary: "Critical disk space on {{ $labels.instance }}"
description: "Filesystem {{ $labels.mountpoint }} is nearly full."
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | printf \"%.1f\" }}%."
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.job }}/{{ $labels.instance }} has been unreachable for 1 minute."
- alert: HighLoad
expr: node_load1 / count without(cpu, mode)(node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High system load on {{ $labels.instance }}"
description: "Load average per CPU is {{ $value | printf \"%.2f\" }}."
sudo chown -R prometheus:prometheus /etc/prometheus/rules
promtool check rules /etc/prometheus/rules/alerts.yml
sudo systemctl reload prometheus
Alertmanager Setup
Install Alertmanager
AM_VERSION="0.27.0"
cd /tmp
wget "https://github.com/prometheus/alertmanager/releases/download/v${AM_VERSION}/alertmanager-${AM_VERSION}.linux-amd64.tar.gz"
tar xzf alertmanager-${AM_VERSION}.linux-amd64.tar.gz
sudo useradd --system --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp alertmanager-${AM_VERSION}.linux-amd64/alertmanager /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /etc/alertmanager /var/lib/alertmanager
Configure routes
sudo nano /etc/alertmanager/alertmanager.yml
global:
smtp_smarthost: "smtp.example.com:587"
smtp_from: "alerts@example.com"
smtp_auth_username: "alerts@example.com"
smtp_auth_password: "your-smtp-password"
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
route:
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "ops-team"
routes:
- match:
severity: critical
receiver: "pagerduty"
continue: true
- match:
severity: warning
receiver: "slack-warnings"
receivers:
- name: "ops-team"
email_configs:
- to: "ops@example.com"
send_resolved: true
- name: "slack-warnings"
slack_configs:
- channel: "#alerts"
title: "{{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}{{ end }}"
send_resolved: true
- name: "pagerduty"
pagerduty_configs:
- routing_key: "YOUR_PAGERDUTY_INTEGRATION_KEY"
send_resolved: true
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["alertname", "instance"]
sudo nano /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
After=network.target
[Service]
Type=simple
User=alertmanager
Group=alertmanager
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager
Restart=on-failure
NoNewPrivileges=true
ProtectSystem=strict
ReadWritePaths=/var/lib/alertmanager
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
Recording Rules
Recording rules pre-compute expensive PromQL expressions and store the result as a new metric. This dramatically speeds up dashboards that aggregate across hundreds of nodes.
# /etc/prometheus/rules/recording.yml
groups:
- name: node_recording
interval: 1m
rules:
- record: instance:node_cpu_usage:percent
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: instance:node_memory_usage:percent
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
- record: instance:node_filesystem_usage:percent
expr: (1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|devtmpfs"} / node_filesystem_size_bytes)) * 100
- record: instance:node_network_receive_bytes:rate5m
expr: sum by (instance) (rate(node_network_receive_bytes_total{device!="lo"}[5m]))
After writing this file, run promtool check rules /etc/prometheus/rules/recording.yml and reload Prometheus.
Storage Tuning
Prometheus TSDB compresses time-series data efficiently — expect about 1-2 bytes per sample. A server scraping 1,000 metrics every 15 seconds accumulates roughly 2 GB per month.
# In prometheus.service ExecStart flags:
# Keep data for 90 days
--storage.tsdb.retention.time=90d
# OR cap storage size (Prometheus will delete oldest blocks first)
--storage.tsdb.retention.size=20GB
# Tune the write-ahead log compression
--storage.tsdb.wal-compression
# Head block chunk range (default 2h, lower = less memory on restart)
--storage.tsdb.min-block-duration=2h
Check current storage usage:
du -sh /var/lib/prometheus/
# View TSDB stats in the UI
curl http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool
Security
Basic authentication (Prometheus 2.24+)
# Generate a bcrypt hash
htpasswd -nBC 10 admin
# Copy the hash output
sudo nano /etc/prometheus/web.yml
basic_auth_users:
admin: "$2y$10$your-bcrypt-hash-here"
# Add to ExecStart in prometheus.service
--web.config.file=/etc/prometheus/web.yml
TLS termination with nginx reverse proxy
server {
listen 443 ssl;
server_name prometheus.example.com;
ssl_certificate /etc/letsencrypt/live/prometheus.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/prometheus.example.com/privkey.pem;
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.prometheus-htpasswd;
location / {
proxy_pass http://127.0.0.1:9090;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Bind Prometheus to localhost only when using a reverse proxy:
--web.listen-address=127.0.0.1:9090
Firewall rules
# Block direct access — only nginx exposed externally
sudo ufw deny 9090
sudo ufw deny 9093
# Node Exporter: allow only from Prometheus server IP
sudo ufw allow from 10.0.0.5 to any port 9100
Comparison Table
| Feature | Prometheus | Zabbix | Nagios | Datadog | Netdata |
|---|---|---|---|---|---|
| Model | Pull (scrape) | Push/Pull | Pull (active checks) | Push (agent) | Push (parent-child) |
| Storage | Local TSDB | PostgreSQL/MySQL | Flat files | SaaS cloud | DB / Netdata Cloud |
| Query language | PromQL (powerful) | Custom + SQL | None | DogStatsD metrics | Custom |
| Alerting | Alertmanager | Built-in | Built-in | Built-in | Built-in |
| Dashboards | Grafana (external) | Built-in | None native | Built-in | Built-in |
| Auto-discovery | Yes (file/SD) | Yes | Limited | Yes (Agent) | Yes |
| Cost | Free / open-source | Free (enterprise paid) | Free (enterprise paid) | SaaS pricing | Free (cloud paid) |
| Scalability | Thanos/Cortex/Mimir | Good with proxy | Poor (monolith) | Managed | Good |
| Learning curve | Moderate (PromQL) | Steep | Steep | Low | Very low |
| Best for | Cloud-native, Kubernetes | Enterprise IT | Legacy / SNMP | Managed SaaS | Real-time per-host |
Production Stack Recipe
A minimal production setup for a fleet of Linux servers:
Prometheus server (t3.small or 2 vCPUs / 4 GB RAM):
- /var/lib/prometheus → separate data volume (100 GB+)
- Retention: 30d time OR 20 GB size
- Scrapers: node_exporter on every host, blackbox_exporter for HTTP checks
Node Exporter (on every server):
- systemd service, port 9100
- Collectors: --collector.systemd --collector.processes
- Firewall: allow 9100 only from Prometheus server IP
Alertmanager:
- Deduplicated alert routing: Slack for warnings, PagerDuty for critical
- Inhibit rules: suppress warnings when critical fires for same instance
Grafana (optional but recommended):
- Docker: docker run -d -p 3000:3000 grafana/grafana
- Import dashboard ID 1860 (Node Exporter Full) from grafana.com/dashboards
Reverse proxy (nginx + TLS):
- Prometheus: https://prometheus.internal.example.com (VPN-only)
- Grafana: https://grafana.example.com (public, auth required)
Gotchas and Edge Cases
rate() requires at least two samples. If a counter resets (process restart), rate() handles it gracefully. irate() is more sensitive to resets — use rate() for alerting rules.
Cardinality explosion kills performance. Never use high-cardinality labels like user_id, session_id, or request_url in metrics. Each unique label combination creates a separate time-series. A few thousand extra series slow queries; millions crash Prometheus.
for clause in alerts prevents flapping. An alert with for: 5m must be continuously firing for 5 minutes before it sends a notification. Omit for and a 1-second spike triggers a page.
Recording rules need time to populate. After adding a recording rule, the new metric only exists from that point forward — you cannot query it for historical data that predates the rule.
Scrape timeout must be less than scrape interval. If scrape_timeout (default 10s) exceeds scrape_interval, Prometheus logs errors. If your target is slow, increase scrape_interval for that job.
Node Exporter on Ubuntu 22.04 may need --collector.systemd flag. Without it, node_systemd_unit_state metrics are missing. Add the flag explicitly to the systemd ExecStart.
Summary
- Prometheus uses a pull model — it scrapes
/metricsendpoints at a configurable interval - Node Exporter exposes hardware and OS metrics from
/proc//syson port 9100 - PromQL functions
rate(),increase(), andhistogram_quantile()are the core of useful queries - Write alerting rules with a
forclause to prevent flapping; route through Alertmanager for deduplication and silencing - Recording rules pre-compute expensive aggregations — critical for multi-node dashboards
- Tune storage retention with
--storage.tsdb.retention.timeor--storage.tsdb.retention.size - Always put Prometheus behind a reverse proxy with TLS and restrict direct port access via firewall
- Avoid high-cardinality labels — they are the most common cause of performance problems