You cannot fix what you cannot see. When a server’s CPU spikes at 3 AM, a disk fills up silently, or a network interface starts dropping packets, you need dashboards that tell you exactly what happened, when, and where. Grafana combined with Prometheus gives you a production-grade monitoring stack that scales from a handful of servers to thousands of nodes — all with open-source tools.
This guide walks you through deploying the Prometheus-Grafana stack, building dashboards that surface the metrics that matter, writing effective PromQL queries, and configuring alerts that notify you before problems become outages.
Prerequisites
Before you begin, ensure you have:
- Linux server (Ubuntu 22.04 LTS recommended) with at least 2 GB RAM
- Docker and Docker Compose installed
- Network access to the servers you want to monitor
- Basic familiarity with YAML and the Linux terminal
Step 1: Deploy the Monitoring Stack
The fastest way to get Prometheus and Grafana running is with Docker Compose. This setup includes Prometheus, Grafana, and Node Exporter (for host metrics).
Create the Project Structure
mkdir -p monitoring/{prometheus,grafana}
cd monitoring
Docker Compose Configuration
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:v2.51.0
container_name: prometheus
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
grafana:
image: grafana/grafana:11.0.0
container_name: grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=changeme
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
restart: unless-stopped
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:v1.8.0
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node-exporter:9100'
labels:
instance: 'monitoring-server'
# Add remote servers running Node Exporter
# - targets: ['192.168.1.10:9100']
# labels:
# instance: 'web-01'
# - targets: ['192.168.1.11:9100']
# labels:
# instance: 'db-01'
Start the Stack
docker compose up -d
Verify everything is running:
docker compose ps
# Check Prometheus targets: http://your-server:9090/targets
# Access Grafana: http://your-server:3000 (admin / changeme)
Step 2: Add Prometheus as a Data Source
- Open Grafana at
http://your-server:3000 - Log in with
admin/changeme - Navigate to Connections > Data Sources > Add data source
- Select Prometheus
- Set the URL to
http://prometheus:9090(Docker internal DNS) - Click Save & Test — you should see “Data source is working”
You can also provision data sources automatically by creating a YAML file:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Mount this into Grafana by adding to your Docker Compose:
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
Step 3: Essential PromQL Queries
PromQL (Prometheus Query Language) is how you extract meaningful data from Prometheus. Here are the essential queries for infrastructure monitoring.
CPU Metrics
# CPU usage percentage (across all cores)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# CPU usage by mode (user, system, iowait)
avg by (instance, mode) (irate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100
# CPU load average (1 minute)
node_load1
Memory Metrics
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Available memory in GB
node_memory_MemAvailable_bytes / 1024 / 1024 / 1024
# Swap usage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100
Disk Metrics
# Disk usage percentage per mount point
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"})) * 100
# Disk I/O rate (reads + writes per second)
irate(node_disk_read_bytes_total[5m]) + irate(node_disk_written_bytes_total[5m])
# Disk I/O utilization
irate(node_disk_io_time_seconds_total[5m]) * 100
Network Metrics
# Network receive rate (bytes/sec)
irate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
# Network transmit rate (bytes/sec)
irate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*|br-.*"}[5m])
# Network errors
irate(node_network_receive_errs_total[5m]) + irate(node_network_transmit_errs_total[5m])
System Uptime
# Uptime in days
(time() - node_boot_time_seconds) / 86400
Step 4: Build the Dashboard
Create a New Dashboard
- Click + > New Dashboard
- Click Add visualization
- Select your Prometheus data source
- Enter a PromQL query in the query editor
Panel Types and When to Use Them
| Panel Type | Best For | Example |
|---|---|---|
| Time Series | Metrics over time | CPU usage, network bandwidth |
| Stat | Single current value | Uptime, total memory |
| Gauge | Value within a range | Disk usage %, CPU temp |
| Bar Gauge | Comparing values | Disk usage per mount |
| Table | Tabular data | Top processes, server list |
| Heatmap | Distribution over time | Request latency buckets |
Example: CPU Usage Panel
Create a Time Series panel with this configuration:
Query:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Panel settings:
- Title:
CPU Usage % - Unit:
Percent (0-100) - Min: 0, Max: 100
- Thresholds: Green (0-70), Yellow (70-85), Red (85-100)
- Legend:
{{instance}}
Example: Memory Gauge Panel
Create a Gauge panel:
Query:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Panel settings:
- Title:
Memory Usage - Unit:
Percent (0-100) - Thresholds: Green (0-75), Orange (75-90), Red (90-100)
Step 5: Template Variables
Variables make dashboards dynamic. Instead of hardcoding server names, let users select from a dropdown.
Create an Instance Variable
-
Go to Dashboard Settings > Variables > New
-
Configure:
- Name:
instance - Type: Query
- Data source: Prometheus
- Query:
label_values(node_uname_info, instance) - Multi-value: Enable
- Include All option: Enable
- Name:
-
Click Apply
Use Variables in Queries
Replace hardcoded instance labels with $instance:
# CPU usage filtered by selected instance
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle", instance=~"$instance"}[5m])) * 100)
# Memory for selected instance
(1 - (node_memory_MemAvailable_bytes{instance=~"$instance"} / node_memory_MemTotal_bytes{instance=~"$instance"})) * 100
Add a Job Variable
label_values(up, job)
This lets users filter by monitor type (node-exporter, blackbox, application exporters).
Step 6: Alerting in Grafana
Create an Alert Rule
- Navigate to Alerting > Alert Rules > New Alert Rule
- Define the rule:
High CPU Alert:
- Rule name: High CPU Usage
- Query:
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) - Condition: IS ABOVE 90
- Evaluate every: 1m
- For: 5m (alert fires after 5 minutes of sustained high CPU)
Low Disk Space Alert:
- Rule name: Low Disk Space
- Query:
(1 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay",mountpoint="/"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay",mountpoint="/"})) * 100 - Condition: IS ABOVE 85
- For: 10m
Configure Notification Channels
Set up a Slack notification contact point:
- Go to Alerting > Contact Points > New Contact Point
- Select Slack
- Configure the webhook URL from your Slack workspace
- Set the channel (e.g.,
#alerts-infra)
For email notifications:
# Add to Grafana environment in docker-compose.yml
environment:
- GF_SMTP_ENABLED=true
- GF_SMTP_HOST=smtp.company.com:587
- GF_SMTP_USER=grafana@company.com
- GF_SMTP_PASSWORD=smtp_password
- GF_SMTP_FROM_ADDRESS=grafana@company.com
Dashboard JSON Export/Import
Save your dashboards as JSON for version control:
# Export via API
curl -s -H "Authorization: Bearer YOUR_API_KEY" \
http://localhost:3000/api/dashboards/uid/YOUR_DASHBOARD_UID | \
jq '.dashboard' > dashboard-infra.json
# Import via API
curl -s -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d "{\"dashboard\": $(cat dashboard-infra.json), \"overwrite\": true}" \
http://localhost:3000/api/dashboards/db
Store dashboard JSON files in Git alongside your infrastructure code for reproducible monitoring setups.
Troubleshooting Common Issues
Grafana Shows “No Data”
# Verify Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Check if the metric exists in Prometheus
curl -s "http://localhost:9090/api/v1/query?query=up" | jq
# Verify the data source URL in Grafana
# Use http://prometheus:9090 (Docker DNS), NOT localhost
Node Exporter Not Appearing in Targets
# Check if Node Exporter is running
curl -s http://localhost:9100/metrics | head -5
# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload
# Check Prometheus logs
docker compose logs prometheus | tail -20
Dashboard Variables Not Populating
Ensure the variable query uses the correct metric name:
# Correct: use a metric that exists
label_values(node_uname_info, instance)
# Wrong: using a metric that hasn't been scraped yet
label_values(nonexistent_metric, instance)
Verify in Prometheus that node_uname_info returns results before using it in a variable query.
High Memory Usage by Prometheus
# Limit retention in prometheus.yml command args
command:
- '--storage.tsdb.retention.time=15d' # Reduce from 30d
- '--storage.tsdb.retention.size=5GB' # Add size limit
Dashboard Best Practices
- Use rows to organize — group related panels (CPU, Memory, Disk, Network) into collapsible rows
- Set meaningful thresholds — green/yellow/red indicators help users spot problems instantly
- Use consistent units — standardize on bytes vs. GB, percent vs. ratio across all panels
- Limit the time range — default to 6h or 12h; long ranges slow queries on large datasets
- Add annotations — mark deployments, incidents, and maintenance windows on your dashboards
- Use stat panels for overview — place key metrics (uptime, total servers, alerts firing) at the top
- Export to JSON — version control your dashboards in Git alongside infrastructure code
Summary
A well-built Grafana dashboard turns raw Prometheus metrics into actionable visibility across your infrastructure. You’ve learned how to deploy the Prometheus-Grafana stack with Docker Compose, write PromQL queries for CPU, memory, disk, and network metrics, build dashboards with appropriate panel types, implement template variables for multi-server filtering, and configure alerting rules with notification channels.
Start with the essential system metrics covered here, then expand to application-specific exporters (MySQL, PostgreSQL, Redis, nginx) as your monitoring needs grow. The combination of Prometheus’s reliable metric collection and Grafana’s flexible visualization gives you a monitoring platform that scales from a home lab to enterprise infrastructure.