What is OpenTelemetry and why should I use it?

OpenTelemetry is a vendor-neutral CNCF standard for collecting traces, metrics, and logs. It frees you from proprietary agents so you can switch backends without re-instrumenting code.

What is the difference between traces, metrics, and logs in OpenTelemetry?

Traces capture request flows across services as spans, metrics aggregate numerical measurements over time, and logs record discrete timestamped events with free-form context.

Can OpenTelemetry replace Datadog or New Relic?

OpenTelemetry replaces their proprietary agents for data collection. You still need a backend (Grafana, Jaeger, Prometheus, or their own) to store and visualize the data.

What is the OTel Collector and do I need it?

The OTel Collector is a standalone pipeline that receives, processes, and exports telemetry. It is optional but strongly recommended in production for batching, filtering, and backend routing.

OpenTelemetry Guide: Unified Observability for Traces, Metrics, and Logs

TL;DR — Quick Summary

OpenTelemetry unifies traces, metrics, and logs under one vendor-neutral standard. Master OTel Collector, auto-instrumentation, and Grafana backends.

OpenTelemetry (OTel) is the open-source CNCF standard that unifies the collection of traces, metrics, and logs under one vendor-neutral API, SDK, and wire protocol. Before OTel, every observability vendor — Datadog, New Relic, Dynatrace, Elastic — required its own proprietary agent, forcing teams to instrument code multiple times and creating hard lock-in. OTel breaks that dependency: instrument once, ship to any backend. This guide covers the complete OTel architecture, Collector configuration, auto and manual instrumentation, distributed tracing, Kubernetes deployment patterns, and a practical Node.js + Python microservices example with Grafana.

Prerequisites

Familiarity with Docker and basic Kubernetes (for the Kubernetes sections)
A target application in Node.js, Python, Java, .NET, or Go
Docker Compose or a Kubernetes cluster for the practical example
Basic understanding of what traces, metrics, and logs are conceptually
kubectl configured if using the Kubernetes operator sections

OpenTelemetry Architecture

OTel is organized into three layers that are cleanly separated:

API — Language-specific interfaces (e.g., tracer.startSpan()). This is what your application code calls. The API alone does nothing; it needs an SDK implementation injected at runtime. This lets library authors instrument code without forcing a specific OTel version on their users.

SDK — The implementation of the API. The SDK handles sampling decisions, resource detection (hostname, k8s pod name, cloud region), and exporting telemetry to a backend via OTLP. Configure the SDK with environment variables or programmatic setup.

Collector — A standalone binary (or Docker container) that acts as a telemetry pipeline: receive from apps → process → export to backends. Applications send OTLP to the Collector; the Collector fans out to multiple backends. The Collector decouples instrumentation from backend configuration.

OTLP (OpenTelemetry Line Protocol) is the canonical wire format: protobuf over gRPC (port 4317) or HTTP/JSON (port 4318). Nearly every modern observability backend speaks OTLP natively.

Signals are the three data types OTel handles: Traces (request flows across services), Metrics (aggregated numerical measurements), and Logs (discrete timestamped events).

The OpenTelemetry Collector

The Collector is the most powerful component in a production OTel deployment. It runs as a pipeline with three stages:

Receivers

Receivers accept telemetry from sources:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"
  prometheus:
    config:
      scrape_configs:
        - job_name: "my-app"
          static_configs:
            - targets: ["app:8080"]
  filelog:
    include: ["/var/log/app/*.log"]
    operators:
      - type: json_parser
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}

Processors

Processors transform and filter telemetry between receive and export:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
  batch:
    send_batch_size: 10000
    timeout: 10s
  attributes:
    actions:
      - key: environment
        value: production
        action: insert
      - key: http.user_agent
        action: delete
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

tail_sampling runs in the Collector (not the SDK) and makes decisions after seeing the whole trace — critical for capturing all errors while sampling only a fraction of healthy fast traces.

Exporters and Pipelines

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:9464"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
  otlphttp/elasticsearch:
    endpoint: "http://elasticsearch:9200"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheus]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, batch]
      exporters: [loki]

Pipelines are independent per signal type. A single Collector can simultaneously feed traces to Tempo, metrics to Prometheus, and logs to Loki.

Auto-Instrumentation

Auto-instrumentation adds OTel without modifying application code. It intercepts HTTP, database, and messaging calls automatically.

Java — Run the OTel Java agent as a JVM flag:

java -javaagent:/otel-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://collector:4317 \
  -jar my-app.jar

Python — Use opentelemetry-instrument:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument \
  --service_name my-service \
  --exporter_otlp_endpoint http://collector:4317 \
  python app.py

Node.js — Import the SDK before any application code:

// tracing.js — must be loaded first via --require
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  serviceName: 'my-node-service',
  traceExporter: new OTLPTraceExporter({ url: 'http://collector:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Run with: node --require ./tracing.js app.js

.NET — Use the OTel .NET automatic instrumentation or add the NuGet package:

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Go — Go has no bytecode manipulation, so manual SDK initialization is required; there is no zero-code agent. Use otelhttp.NewHandler() to wrap HTTP handlers.

Manual Instrumentation

Auto-instrumentation captures framework-level operations. Manual instrumentation adds business-level context.

Creating Spans

const opentelemetry = require('@opentelemetry/api');

const tracer = opentelemetry.trace.getTracer('my-service', '1.0.0');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.setAttribute('order.region', 'us-east-1');

      const result = await chargePayment(orderId);
      span.addEvent('payment.charged', { 'payment.amount': result.amount });

      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

startActiveSpan automatically propagates context to child spans and HTTP calls made within the callback. recordException captures the stack trace as a span event with exception.type, exception.message, and exception.stacktrace attributes — searchable in Jaeger and Tempo.

Creating Metrics

OTel defines four instrument types for different measurement patterns:

const meter = opentelemetry.metrics.getMeter('my-service', '1.0.0');

// Counter — monotonically increasing (requests, errors)
const requestCounter = meter.createCounter('http.server.requests', {
  description: 'Total HTTP requests',
});

// Histogram — latency distributions, sizes
const latencyHistogram = meter.createHistogram('http.server.duration', {
  description: 'HTTP request latency',
  unit: 'ms',
});

// UpDownCounter — values that go up and down (queue depth, active connections)
const activeConnections = meter.createUpDownCounter('db.connections.active');

// Gauge (Observable) — spot readings (CPU, memory)
meter.createObservableGauge('process.memory.usage', {
  description: 'Process heap memory in bytes',
}).addCallback((result) => {
  result.observe(process.memoryUsage().heapUsed);
});

Aggregation temporality controls how SDK reports metric deltas. Cumulative (default) sends totals since process start — compatible with Prometheus’s model. Delta sends only the change since the last report — required by some backends like Azure Monitor.

Context Propagation

Context propagation connects spans across service boundaries via HTTP headers. OTel uses W3C TraceContext (traceparent, tracestate) by default:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
             ver       traceId (128-bit)         spanId (64-bit)  flags

W3C Baggage propagates user-defined key-value pairs across the entire trace:

const { propagation, context } = require('@opentelemetry/api');
const baggage = propagation.getBaggage(context.active()) || propagation.createBaggage();
const newBaggage = baggage.setEntry('tenant.id', { value: 'acme-corp' });
const newContext = propagation.setBaggage(context.active(), newBaggage);

Auto-instrumentation handles header injection and extraction automatically. For manual HTTP calls, use propagation.inject(context.active(), headers).

Distributed Tracing Concepts

A trace is a directed acyclic graph of spans representing one end-to-end request through your system. Every span has:

TraceId — 128-bit ID shared by all spans in the trace
SpanId — 64-bit ID unique to this span
ParentSpanId — links child to parent (root span has no parent)
SpanKind — SERVER (incoming request), CLIENT (outgoing call), PRODUCER/CONSUMER (messaging), INTERNAL (internal operation)
Attributes — key-value metadata following OTel semantic conventions (http.method, db.statement, rpc.service)
Events — timestamped annotations on the span timeline
Status — Unset, Ok, or Error with optional description

Sampling controls what percentage of traces are recorded. Head-based sampling decides at the root span — simple but cannot prefer slow/error traces because it decides before they complete. Tail-based sampling (via the Collector’s tail_sampling processor) collects the full trace first and then keeps it based on the outcome — capturing 100% of errors while sampling only 10% of healthy traces.

Metrics in Depth

OTel metrics integrate with Prometheus via the Collector’s Prometheus exporter or the Prometheus receiver that scrapes /metrics endpoints. Exemplars are the key feature linking metrics to traces: they attach a traceId/spanId to specific histogram observations, enabling one-click navigation from a slow P99 latency spike directly to a representative trace in Tempo.

Enable exemplars in the SDK:

const { ExplicitBucketHistogramAggregation } = require('@opentelemetry/sdk-metrics');
// Exemplars are automatically recorded when active span context is present

In Grafana, set exemplarsEnabled: true on the Prometheus datasource and configure Tempo as the trace datasource to enable the metric → trace drill-down.

Logs and Correlation

OTel log correlation attaches trace_id and span_id to every log line emitted while a span is active, enabling navigation from a Loki log entry directly to the corresponding Tempo trace.

The Log Bridge API connects existing logging libraries (Winston, Pino, Python logging, Log4j) to the OTel SDK without replacing them:

const { logs } = require('@opentelemetry/api-logs');
const logger = logs.getLogger('my-service');

// Manual log record — auto-instrumentation handles existing loggers
logger.emit({
  severityText: 'ERROR',
  body: 'Payment processing failed',
  attributes: { 'order.id': orderId, 'payment.provider': 'stripe' },
});

For structured logging that already emits JSON, the Collector’s filelog receiver + JSON parser is often easier than changing the application — the Collector adds trace_id injection if attributes are present in the log record.

Deployment Patterns

Agent mode — One Collector per host node (DaemonSet in Kubernetes). Applications send OTLP to localhost:4317. The agent does batching and forwards to a remote gateway or backend. Low latency, simple network path.

Gateway mode — Centralized Collector cluster (Deployment in Kubernetes, scaled horizontally). All agents forward to the gateway. The gateway runs expensive processors (tail sampling requires all spans from a trace on the same instance — use loadbalancing exporter to route by TraceId).

Sidecar mode — One Collector container per application Pod. Full isolation, no shared state. Useful for multi-tenant scenarios or when applications have very different processing requirements.

Kubernetes with the OTel Operator

The OpenTelemetry Operator manages Collectors and auto-instrumentation injection via Kubernetes CRDs:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Deploy a Collector as a DaemonSet:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-agent
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
    processors:
      batch: {}
      memory_limiter:
        limit_mib: 200
    exporters:
      otlp:
        endpoint: "otel-gateway:4317"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]

Enable auto-instrumentation injection per namespace or pod using annotations:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
spec:
  exporter:
    endpoint: http://otel-agent:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest

Annotate a deployment to inject the Node.js agent automatically:

annotations:
  instrumentation.opentelemetry.io/inject-nodejs: "true"

Backends and Comparison

Backend	Signals	Key Strength	Storage
Jaeger	Traces only	Simple UI, mature	Cassandra/Elasticsearch/memory
Prometheus	Metrics only	PromQL, huge ecosystem	Local TSDB
Grafana Loki	Logs only	Low cost, label-based	Object storage
Grafana Tempo	Traces only	TraceQL, exemplar links, cheap	Object storage
SigNoz	Traces + Metrics + Logs	All-in-one OTel-native	ClickHouse
Grafana Stack	All signals	Unified correlation, dashboards	Multiple

Platform	Data Collection	Vendor Lock-in	Cost Model	Custom Backends
OpenTelemetry	OTLP / agents	None (open standard)	Free (you host backends)	Any
Datadog	DD Agent (proprietary)	High	Per-host + ingestion	No
New Relic	NR Agent / OTel	Medium	Per GB ingested	Limited
Dynatrace	OneAgent (proprietary)	Very high	Per host unit	No
Elastic APM	Elastic Agent / OTel	Medium	Per GB	Elasticsearch only
Grafana Cloud	OTel-native	Low	Usage-based, free tier	Any via self-host

OTel + Grafana Stack (Tempo + Prometheus + Loki) gives the lowest vendor lock-in and the best per-signal cost model for teams willing to operate their own backends.

Real-World Example: Node.js + Python with Grafana Stack

A practical two-service setup: a Node.js API gateway calls a Python order service. All telemetry flows through a single Collector to Tempo, Prometheus, and Loki, visualized in Grafana.

docker-compose.yml (excerpt):

services:
  collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./collector-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]
    ports: ["4317:4317", "4318:4318", "9464:9464"]

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    ports: ["3200:3200"]

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
    ports: ["3000:3000"]

  node-api:
    build: ./node-api
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
      - OTEL_SERVICE_NAME=node-api
      - OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

  python-orders:
    build: ./python-orders
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
      - OTEL_SERVICE_NAME=python-orders

The Node.js service uses auto-instrumentation via --require ./tracing.js. The Python service uses opentelemetry-instrument. Both emit OTLP to the Collector. The Collector’s tail-sampling processor keeps all error traces and 10% of healthy traces. In Grafana: open Explore → Tempo → search by service.name=node-api → click a trace → use the exemplar link to see the correlated Prometheus P95 latency → click the Loki link to see the exact log line from that request.

Gotchas and Edge Cases

Clock skew — Distributed traces rely on synchronized clocks. NTP drift over 1ms causes span ordering issues in the UI. Use chrony or cloud-native time sync; do not ignore clock warnings in Collector logs.

Tail sampling requires TraceId-consistent routing — If you run multiple Collector replicas for the gateway tier and use tail_sampling, all spans from the same trace must arrive at the same Collector instance. Use the loadbalancing exporter with routing_key: traceId in the agent tier to hash-route to the correct gateway.

OTLP gRPC requires HTTP/2 — Some reverse proxies (older nginx versions, AWS ALB without grpc-web) do not support HTTP/2 grpc. Use OTLP HTTP (port 4318) if you route through a proxy that terminates connections.

Cardinality explosion — Never use high-cardinality values (user IDs, order IDs, request IDs) as metric label values. OTel attributes on spans are fine at any cardinality; metric dimensions are not. The filter processor in the Collector can drop dangerous attributes before they reach Prometheus.

Auto-instrumentation version conflicts — The OTel Node.js SDK pins specific versions of instrumentation packages. If your app uses an older version of express or pg, the auto-instrumentation package may not match. Pin @opentelemetry/auto-instrumentations-node and check the compatibility matrix.

SDK initialization order — In Node.js, the SDK start() must complete before any instrumented module is require()d. Use --require to load the tracing file; do not use a dynamic import() inside the app entry point.

Summary

OpenTelemetry provides a single vendor-neutral API, SDK, and wire protocol (OTLP) for traces, metrics, and logs
The OTel Collector decouples applications from backends; use agents (DaemonSet) feeding a gateway (Deployment) with tail sampling in production
Auto-instrumentation (Java agent, Python opentelemetry-instrument, Node.js SDK) adds traces with zero code changes; manual instrumentation adds business context with startActiveSpan and createCounter/createHistogram
W3C TraceContext headers propagate trace identity across service boundaries automatically when using auto-instrumentation
Tail-based sampling in the Collector captures 100% of error traces while sampling a fraction of healthy traffic
Exemplars link Prometheus metric observations to specific Tempo trace IDs for one-click drill-down
The Grafana Stack (Tempo + Prometheus + Loki + Grafana) is the most popular open-source OTel backend combination
The OTel Operator manages Collector DaemonSets and auto-instrumentation injection via annotations in Kubernetes

OpenTelemetry Guide: Unified Observability for Traces, Metrics, and Logs

Prerequisites

OpenTelemetry Architecture

The OpenTelemetry Collector

Receivers

Processors

Exporters and Pipelines

Auto-Instrumentation

Manual Instrumentation

Creating Spans

Creating Metrics

Context Propagation

Distributed Tracing Concepts

Metrics in Depth

Logs and Correlation

Deployment Patterns

Kubernetes with the OTel Operator

Backends and Comparison

Real-World Example: Node.js + Python with Grafana Stack

Gotchas and Edge Cases

Summary

Guide & Instructions

Install the OpenTelemetry Collector

Configure receivers, processors, and exporters

Instrument your application

Propagate trace context across services

Deploy backends for storage and visualization

Verify end-to-end trace correlation

Frequently Asked Questions

Prerequisites

OpenTelemetry Architecture

The OpenTelemetry Collector

Receivers

Processors

Exporters and Pipelines

Auto-Instrumentation

Manual Instrumentation

Creating Spans

Creating Metrics

Context Propagation

Distributed Tracing Concepts

Metrics in Depth

Logs and Correlation

Deployment Patterns

Kubernetes with the OTel Operator

Backends and Comparison

Real-World Example: Node.js + Python with Grafana Stack

Gotchas and Edge Cases

Summary

Related Articles

Guide & Instructions

Install the OpenTelemetry Collector

Configure receivers, processors, and exporters

Instrument your application

Propagate trace context across services

Deploy backends for storage and visualization

Verify end-to-end trace correlation

Frequently Asked Questions

Related Articles

Prometheus + Node Exporter: Linux Server Monitoring Guide

Grafana Dashboards for Infrastructure Monitoring: A Practical Guide

Prometheus and Grafana Monitoring: Setup, Alerting, and Troubleshooting