Knowing what is happening on your servers is not optional — it is essential. Whether you manage a single VPS or a fleet of production machines, real-time visibility into CPU usage, memory consumption, disk I/O, and network traffic is what separates proactive infrastructure management from firefighting. Prometheus and Grafana together form the most widely adopted open-source monitoring stack in the industry, powering observability at organizations from startups to enterprises.
This guide walks you through the complete process: installing Prometheus and node_exporter to collect system metrics, setting up Grafana for visualization, writing PromQL queries, configuring alerting with Alertmanager, and monitoring Docker containers with cAdvisor. By the end, you will have a fully functional monitoring stack running on Ubuntu.
Prerequisites
Before you begin, make sure you have:
- A server running Ubuntu 22.04 or 24.04 (desktop or server edition)
- Terminal access with sudo privileges
- At least 2 GB of RAM and 20 GB of free disk space
- Ports 9090 (Prometheus), 9100 (node_exporter), 3000 (Grafana), and 9093 (Alertmanager) available
- Basic familiarity with the Linux command line and YAML syntax
- Docker installed (optional, for cAdvisor section) — see our Docker installation guide
Note: All commands in this guide are for Ubuntu on the amd64 architecture. If you are running ARM64, adjust the download URLs to use the
linux-arm64binary variants.
What Are Prometheus and Grafana?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. It collects metrics by scraping HTTP endpoints at regular intervals, stores them in a time-series database (TSDB), and provides a powerful query language called PromQL for analysis. Prometheus follows a pull-based model — it actively fetches metrics from your servers rather than waiting for data to be pushed.
Grafana is an open-source analytics and interactive visualization platform. It connects to data sources like Prometheus, Elasticsearch, InfluxDB, and PostgreSQL to render real-time dashboards with graphs, tables, heatmaps, and alerts. Grafana does not store data itself; it queries the underlying data source on demand.
Together, they form a complete monitoring pipeline:
- Exporters (like node_exporter) expose metrics on HTTP endpoints
- Prometheus scrapes and stores those metrics
- Grafana queries Prometheus and displays the data visually
- Alertmanager handles notifications when metrics cross defined thresholds
Monitoring Architecture Overview
Understanding the data flow is critical before deploying the stack:
┌─────────────────┐ scrape ┌──────────────┐ query ┌──────────────┐
│ node_exporter │────────────────>│ Prometheus │<─────────────│ Grafana │
│ :9100 │ /metrics │ :9090 │ │ :3000 │
└─────────────────┘ │ │ │ │
│ TSDB │ │ Dashboards │
┌─────────────────┐ scrape │ PromQL │ │ Panels │
│ cAdvisor │────────────────>│ Alert Rules │ │ Alerts │
│ :8080 │ /metrics │ │ └──────────────┘
└─────────────────┘ └──────┬───────┘
│ fire alerts
┌──────▼───────┐
│ Alertmanager │──> Email / Slack / PagerDuty
│ :9093 │
└──────────────┘
Prometheus uses a pull model: it periodically sends HTTP GET requests to configured targets (exporters) to fetch the latest metrics. Each exporter exposes a /metrics endpoint that returns data in Prometheus exposition format. This architecture means targets do not need to know about Prometheus — they simply serve their metrics when asked.
Installing Prometheus
Start by creating a dedicated system user and the necessary directories:
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Download the latest Prometheus release (check prometheus.io/download for the current version):
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvf prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64
Copy the binaries and configuration files:
sudo cp prometheus promtool /usr/local/bin/
sudo cp -r consoles console_libraries /etc/prometheus/
sudo cp prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
Verify the installation:
prometheus --version
You should see output displaying the version, build date, and Go version.
Configuring Prometheus
The main configuration file is /etc/prometheus/prometheus.yml. This YAML file defines global settings, scrape intervals, and target endpoints.
Create a clean configuration:
sudo nano /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "alert_rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
Key configuration parameters:
- scrape_interval: How often Prometheus scrapes targets (15 seconds is the recommended default)
- evaluation_interval: How often Prometheus evaluates alerting rules
- scrape_timeout: Maximum time to wait for a scrape response before marking the target as down
- job_name: A label applied to all metrics collected from the targets in this group
- static_configs: A list of target endpoints to scrape
Create a systemd service file for Prometheus:
sudo nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--storage.tsdb.retention.time=30d \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
Enable and start Prometheus:
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus
Prometheus should now be running and accessible at http://your-server-ip:9090.
Installing and Configuring node_exporter
node_exporter is the standard Prometheus exporter for hardware and OS-level metrics. It exposes CPU, memory, disk, filesystem, and network statistics on port 9100.
Create a dedicated user and download node_exporter:
sudo useradd --no-create-home --shell /bin/false node_exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xvf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Create the systemd service:
sudo nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
Enable and start node_exporter:
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter
Verify that metrics are being exposed:
curl -s http://localhost:9100/metrics | head -20
You should see lines beginning with # HELP and # TYPE followed by metric names and values. Confirm that Prometheus is scraping node_exporter by navigating to http://your-server-ip:9090/targets — the node job should show a status of UP.
Installing Grafana
Add the official Grafana APT repository:
sudo apt-get install -y apt-transport-https software-properties-common wget
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
Install and start Grafana:
sudo apt-get update
sudo apt-get install grafana -y
sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Verify Grafana is running:
sudo systemctl status grafana-server
Grafana is now accessible at http://your-server-ip:3000. The default credentials are admin / admin. You will be prompted to change the password on first login.
Security: Change the default admin password immediately. In production, configure Grafana behind a reverse proxy with HTTPS. See our Nginx reverse proxy guide for instructions.
Creating Your First Dashboard
After logging into Grafana, add Prometheus as a data source:
- Navigate to Connections > Data Sources > Add data source
- Select Prometheus
- Set the URL to
http://localhost:9090 - Click Save & Test — you should see “Successfully queried the Prometheus API”
Import the widely-used Node Exporter Full dashboard:
- Navigate to Dashboards > New > Import
- Enter dashboard ID
1860(Node Exporter Full by rfmoz) - Select your Prometheus data source
- Click Import
This dashboard provides comprehensive panels for CPU usage, memory utilization, disk I/O, network traffic, filesystem usage, and system load — all without writing a single query.
To create a custom panel:
- Click Add > Visualization on any dashboard
- In the query editor, enter a PromQL expression such as:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- This calculates the overall CPU usage percentage
- Configure the panel title, legend, thresholds, and unit format
- Click Apply to save the panel
PromQL Basics
PromQL (Prometheus Query Language) is a functional query language that lets you select, aggregate, and transform time-series data. Understanding PromQL is essential for building useful dashboards and alert rules.
Instant vectors return the most recent value for each time series:
node_memory_MemAvailable_bytes
Range vectors return values over a time window:
node_cpu_seconds_total[5m]
rate() calculates the per-second average rate of increase (essential for counters):
rate(node_cpu_seconds_total{mode="idle"}[5m])
Aggregation operators combine multiple time series:
# Average CPU usage across all cores
avg without(cpu) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Total memory across all instances
sum by(instance) (node_memory_MemTotal_bytes)
# Top 5 instances by CPU usage
topk(5, 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))
Mathematical operations allow combining metrics:
# Memory usage percentage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage percentage
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
Label filtering uses curly braces:
# Filter by specific instance
node_cpu_seconds_total{instance="server1:9100", mode!="idle"}
# Regex match
node_filesystem_avail_bytes{mountpoint=~"/|/home"}
Setting Up Alerting Rules
Prometheus evaluates alerting rules at the evaluation_interval defined in the global configuration. When a rule condition is met, Prometheus fires an alert to Alertmanager, which handles deduplication, grouping, and routing to notification channels.
Create an alert rules file:
sudo nano /etc/prometheus/alert_rules.yml
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 5 minutes (current value: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 5 minutes (current value: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root filesystem usage is above 85% (current value: {{ $value }}%)"
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes."
Validate the rules file:
promtool check rules /etc/prometheus/alert_rules.yml
Now install Alertmanager:
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xvf alertmanager-0.27.0.linux-amd64.tar.gz
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
sudo cp alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool
Configure Alertmanager:
sudo nano /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'admin@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'your-app-password'
require_tls: true
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#alerts'
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
Create the Alertmanager systemd service:
sudo nano /etc/systemd/system/alertmanager.service
[Unit]
Description=Prometheus Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/
Restart=on-failure
RestartSec=5s
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
Reload Prometheus to pick up the new alert rules:
sudo systemctl restart prometheus
Verify your alerts at http://your-server-ip:9090/alerts and check Alertmanager at http://your-server-ip:9093.
Monitoring Docker Containers with cAdvisor
cAdvisor (Container Advisor) provides container-level resource usage and performance metrics. It runs as a Docker container itself and exposes metrics in Prometheus format.
Start cAdvisor:
docker run -d \
--name=cadvisor \
--restart=always \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:ro \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--volume=/dev/disk/:/dev/disk:ro \
--publish=8080:8080 \
--privileged \
--device=/dev/kmsg \
gcr.io/cadvisor/cadvisor:v0.49.1
Add cAdvisor as a Prometheus scrape target in /etc/prometheus/prometheus.yml:
- job_name: "cadvisor"
static_configs:
- targets: ["localhost:8080"]
Reload Prometheus:
curl -X POST http://localhost:9090/-/reload
Useful cAdvisor PromQL queries:
# Container CPU usage
rate(container_cpu_usage_seconds_total{name!=""}[5m])
# Container memory usage
container_memory_usage_bytes{name!=""}
# Container network received bytes
rate(container_network_receive_bytes_total{name!=""}[5m])
# Container filesystem usage
container_fs_usage_bytes{name!=""}
Import Grafana dashboard ID 14282 (cAdvisor Exporter) for pre-built container monitoring panels.
Useful PromQL Queries Reference
| Query | Description |
|---|---|
up | Check if targets are reachable (1 = up, 0 = down) |
rate(node_cpu_seconds_total{mode="idle"}[5m]) | CPU idle rate per core |
node_memory_MemAvailable_bytes / 1024 / 1024 | Available memory in MB |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 | Memory usage percentage |
rate(node_disk_read_bytes_total[5m]) | Disk read throughput |
rate(node_disk_written_bytes_total[5m]) | Disk write throughput |
rate(node_network_receive_bytes_total[5m]) | Network incoming traffic |
rate(node_network_transmit_bytes_total[5m]) | Network outgoing traffic |
node_filesystem_avail_bytes{mountpoint="/"} | Available disk space on root |
node_load1 / node_load5 / node_load15 | System load averages |
rate(node_cpu_seconds_total{mode="iowait"}[5m]) | I/O wait percentage |
node_time_seconds - node_boot_time_seconds | System uptime in seconds |
count(node_cpu_seconds_total{mode="idle"}) | Number of CPU cores |
rate(container_cpu_usage_seconds_total{name!=""}[5m]) | Container CPU usage |
Troubleshooting
Prometheus fails to start:
Check the configuration syntax:
promtool check config /etc/prometheus/prometheus.yml
Check the service logs:
sudo journalctl -u prometheus -f --no-pager
Common issues include YAML indentation errors, invalid scrape intervals, and file permission problems on /var/lib/prometheus.
Target shows as DOWN in Prometheus:
Verify the exporter is running:
sudo systemctl status node_exporter
curl -v http://localhost:9100/metrics
Check the firewall:
sudo ufw status
sudo ufw allow 9100/tcp
Grafana cannot connect to Prometheus:
Ensure Prometheus is listening on the correct address. If Grafana and Prometheus are on the same server, use http://localhost:9090. Check connectivity:
curl http://localhost:9090/api/v1/query?query=up
No data in Grafana panels:
Verify the time range selector in the dashboard. The default might be outside your data retention window. Also confirm the Prometheus data source is selected in the panel query editor.
High memory usage by Prometheus:
Reduce the number of time series by limiting exporters or adding metric relabeling rules. Check the current memory and series count:
curl http://localhost:9090/api/v1/status/tsdb
Consider lowering retention time:
--storage.tsdb.retention.time=15d
Alertmanager not sending notifications:
Test the Alertmanager configuration:
amtool check-config /etc/alertmanager/alertmanager.yml
Send a test alert:
amtool alert add alertname=TestAlert severity=critical --alertmanager.url=http://localhost:9093
Summary
You now have a complete monitoring stack running on your server: Prometheus collecting metrics from node_exporter and cAdvisor, Grafana rendering real-time dashboards, and Alertmanager delivering notifications when things go wrong. This setup gives you deep visibility into system health, container performance, and resource utilization.
Key takeaways from this guide:
- Prometheus scrapes metrics from targets at regular intervals and stores them in a time-series database
- node_exporter provides system-level metrics (CPU, memory, disk, network)
- Grafana visualizes metrics with customizable dashboards and panels
- PromQL is the query language for selecting, filtering, and aggregating metrics
- Alertmanager routes alerts to email, Slack, PagerDuty, or other notification channels
- cAdvisor exposes container-level resource metrics for Docker environments
For the foundation this monitoring stack runs on, make sure your server is properly secured following our Linux server security checklist. If you have not yet set up Docker for the cAdvisor section, follow our Docker installation guide on Ubuntu.
As your infrastructure grows, explore Prometheus federation for multi-server setups, Grafana Loki for log aggregation, and Thanos or Cortex for long-term storage and high availability.