TL;DR — Quick Summary

OpenTelemetry unifies traces, metrics, and logs under one vendor-neutral standard. Master OTel Collector, auto-instrumentation, and Grafana backends.

OpenTelemetry (OTel) is the open-source CNCF standard that unifies the collection of traces, metrics, and logs under one vendor-neutral API, SDK, and wire protocol. Before OTel, every observability vendor — Datadog, New Relic, Dynatrace, Elastic — required its own proprietary agent, forcing teams to instrument code multiple times and creating hard lock-in. OTel breaks that dependency: instrument once, ship to any backend. This guide covers the complete OTel architecture, Collector configuration, auto and manual instrumentation, distributed tracing, Kubernetes deployment patterns, and a practical Node.js + Python microservices example with Grafana.

Prerequisites

  • Familiarity with Docker and basic Kubernetes (for the Kubernetes sections)
  • A target application in Node.js, Python, Java, .NET, or Go
  • Docker Compose or a Kubernetes cluster for the practical example
  • Basic understanding of what traces, metrics, and logs are conceptually
  • kubectl configured if using the Kubernetes operator sections

OpenTelemetry Architecture

OTel is organized into three layers that are cleanly separated:

API — Language-specific interfaces (e.g., tracer.startSpan()). This is what your application code calls. The API alone does nothing; it needs an SDK implementation injected at runtime. This lets library authors instrument code without forcing a specific OTel version on their users.

SDK — The implementation of the API. The SDK handles sampling decisions, resource detection (hostname, k8s pod name, cloud region), and exporting telemetry to a backend via OTLP. Configure the SDK with environment variables or programmatic setup.

Collector — A standalone binary (or Docker container) that acts as a telemetry pipeline: receive from apps → process → export to backends. Applications send OTLP to the Collector; the Collector fans out to multiple backends. The Collector decouples instrumentation from backend configuration.

OTLP (OpenTelemetry Line Protocol) is the canonical wire format: protobuf over gRPC (port 4317) or HTTP/JSON (port 4318). Nearly every modern observability backend speaks OTLP natively.

Signals are the three data types OTel handles: Traces (request flows across services), Metrics (aggregated numerical measurements), and Logs (discrete timestamped events).

The OpenTelemetry Collector

The Collector is the most powerful component in a production OTel deployment. It runs as a pipeline with three stages:

Receivers

Receivers accept telemetry from sources:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"
  prometheus:
    config:
      scrape_configs:
        - job_name: "my-app"
          static_configs:
            - targets: ["app:8080"]
  filelog:
    include: ["/var/log/app/*.log"]
    operators:
      - type: json_parser
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}

Processors

Processors transform and filter telemetry between receive and export:

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 400
  batch:
    send_batch_size: 10000
    timeout: 10s
  attributes:
    actions:
      - key: environment
        value: production
        action: insert
      - key: http.user_agent
        action: delete
  filter:
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-traces-policy
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

tail_sampling runs in the Collector (not the SDK) and makes decisions after seeing the whole trace — critical for capturing all errors while sampling only a fraction of healthy fast traces.

Exporters and Pipelines

exporters:
  otlp/tempo:
    endpoint: "tempo:4317"
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:9464"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
  otlphttp/elasticsearch:
    endpoint: "http://elasticsearch:9200"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, tail_sampling]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheus]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, batch]
      exporters: [loki]

Pipelines are independent per signal type. A single Collector can simultaneously feed traces to Tempo, metrics to Prometheus, and logs to Loki.

Auto-Instrumentation

Auto-instrumentation adds OTel without modifying application code. It intercepts HTTP, database, and messaging calls automatically.

Java — Run the OTel Java agent as a JVM flag:

java -javaagent:/otel-javaagent.jar \
  -Dotel.service.name=my-service \
  -Dotel.exporter.otlp.endpoint=http://collector:4317 \
  -jar my-app.jar

Python — Use opentelemetry-instrument:

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument \
  --service_name my-service \
  --exporter_otlp_endpoint http://collector:4317 \
  python app.py

Node.js — Import the SDK before any application code:

// tracing.js — must be loaded first via --require
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  serviceName: 'my-node-service',
  traceExporter: new OTLPTraceExporter({ url: 'http://collector:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Run with: node --require ./tracing.js app.js

.NET — Use the OTel .NET automatic instrumentation or add the NuGet package:

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Exporter.OpenTelemetryProtocol

Go — Go has no bytecode manipulation, so manual SDK initialization is required; there is no zero-code agent. Use otelhttp.NewHandler() to wrap HTTP handlers.

Manual Instrumentation

Auto-instrumentation captures framework-level operations. Manual instrumentation adds business-level context.

Creating Spans

const opentelemetry = require('@opentelemetry/api');

const tracer = opentelemetry.trace.getTracer('my-service', '1.0.0');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.setAttribute('order.region', 'us-east-1');

      const result = await chargePayment(orderId);
      span.addEvent('payment.charged', { 'payment.amount': result.amount });

      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: opentelemetry.SpanStatusCode.ERROR });
      throw err;
    } finally {
      span.end();
    }
  });
}

startActiveSpan automatically propagates context to child spans and HTTP calls made within the callback. recordException captures the stack trace as a span event with exception.type, exception.message, and exception.stacktrace attributes — searchable in Jaeger and Tempo.

Creating Metrics

OTel defines four instrument types for different measurement patterns:

const meter = opentelemetry.metrics.getMeter('my-service', '1.0.0');

// Counter — monotonically increasing (requests, errors)
const requestCounter = meter.createCounter('http.server.requests', {
  description: 'Total HTTP requests',
});

// Histogram — latency distributions, sizes
const latencyHistogram = meter.createHistogram('http.server.duration', {
  description: 'HTTP request latency',
  unit: 'ms',
});

// UpDownCounter — values that go up and down (queue depth, active connections)
const activeConnections = meter.createUpDownCounter('db.connections.active');

// Gauge (Observable) — spot readings (CPU, memory)
meter.createObservableGauge('process.memory.usage', {
  description: 'Process heap memory in bytes',
}).addCallback((result) => {
  result.observe(process.memoryUsage().heapUsed);
});

Aggregation temporality controls how SDK reports metric deltas. Cumulative (default) sends totals since process start — compatible with Prometheus’s model. Delta sends only the change since the last report — required by some backends like Azure Monitor.

Context Propagation

Context propagation connects spans across service boundaries via HTTP headers. OTel uses W3C TraceContext (traceparent, tracestate) by default:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^ ^^
             ver       traceId (128-bit)         spanId (64-bit)  flags

W3C Baggage propagates user-defined key-value pairs across the entire trace:

const { propagation, context } = require('@opentelemetry/api');
const baggage = propagation.getBaggage(context.active()) || propagation.createBaggage();
const newBaggage = baggage.setEntry('tenant.id', { value: 'acme-corp' });
const newContext = propagation.setBaggage(context.active(), newBaggage);

Auto-instrumentation handles header injection and extraction automatically. For manual HTTP calls, use propagation.inject(context.active(), headers).

Distributed Tracing Concepts

A trace is a directed acyclic graph of spans representing one end-to-end request through your system. Every span has:

  • TraceId — 128-bit ID shared by all spans in the trace
  • SpanId — 64-bit ID unique to this span
  • ParentSpanId — links child to parent (root span has no parent)
  • SpanKindSERVER (incoming request), CLIENT (outgoing call), PRODUCER/CONSUMER (messaging), INTERNAL (internal operation)
  • Attributes — key-value metadata following OTel semantic conventions (http.method, db.statement, rpc.service)
  • Events — timestamped annotations on the span timeline
  • StatusUnset, Ok, or Error with optional description

Sampling controls what percentage of traces are recorded. Head-based sampling decides at the root span — simple but cannot prefer slow/error traces because it decides before they complete. Tail-based sampling (via the Collector’s tail_sampling processor) collects the full trace first and then keeps it based on the outcome — capturing 100% of errors while sampling only 10% of healthy traces.

Metrics in Depth

OTel metrics integrate with Prometheus via the Collector’s Prometheus exporter or the Prometheus receiver that scrapes /metrics endpoints. Exemplars are the key feature linking metrics to traces: they attach a traceId/spanId to specific histogram observations, enabling one-click navigation from a slow P99 latency spike directly to a representative trace in Tempo.

Enable exemplars in the SDK:

const { ExplicitBucketHistogramAggregation } = require('@opentelemetry/sdk-metrics');
// Exemplars are automatically recorded when active span context is present

In Grafana, set exemplarsEnabled: true on the Prometheus datasource and configure Tempo as the trace datasource to enable the metric → trace drill-down.

Logs and Correlation

OTel log correlation attaches trace_id and span_id to every log line emitted while a span is active, enabling navigation from a Loki log entry directly to the corresponding Tempo trace.

The Log Bridge API connects existing logging libraries (Winston, Pino, Python logging, Log4j) to the OTel SDK without replacing them:

const { logs } = require('@opentelemetry/api-logs');
const logger = logs.getLogger('my-service');

// Manual log record — auto-instrumentation handles existing loggers
logger.emit({
  severityText: 'ERROR',
  body: 'Payment processing failed',
  attributes: { 'order.id': orderId, 'payment.provider': 'stripe' },
});

For structured logging that already emits JSON, the Collector’s filelog receiver + JSON parser is often easier than changing the application — the Collector adds trace_id injection if attributes are present in the log record.

Deployment Patterns

Agent mode — One Collector per host node (DaemonSet in Kubernetes). Applications send OTLP to localhost:4317. The agent does batching and forwards to a remote gateway or backend. Low latency, simple network path.

Gateway mode — Centralized Collector cluster (Deployment in Kubernetes, scaled horizontally). All agents forward to the gateway. The gateway runs expensive processors (tail sampling requires all spans from a trace on the same instance — use loadbalancing exporter to route by TraceId).

Sidecar mode — One Collector container per application Pod. Full isolation, no shared state. Useful for multi-tenant scenarios or when applications have very different processing requirements.

Kubernetes with the OTel Operator

The OpenTelemetry Operator manages Collectors and auto-instrumentation injection via Kubernetes CRDs:

kubectl apply -f https://github.com/open-telemetry/opentelemetry-operator/releases/latest/download/opentelemetry-operator.yaml

Deploy a Collector as a DaemonSet:

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-agent
spec:
  mode: daemonset
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4317"
    processors:
      batch: {}
      memory_limiter:
        limit_mib: 200
    exporters:
      otlp:
        endpoint: "otel-gateway:4317"
        tls:
          insecure: true
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]

Enable auto-instrumentation injection per namespace or pod using annotations:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
spec:
  exporter:
    endpoint: http://otel-agent:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.1"
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:latest
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest

Annotate a deployment to inject the Node.js agent automatically:

annotations:
  instrumentation.opentelemetry.io/inject-nodejs: "true"

Backends and Comparison

BackendSignalsKey StrengthStorage
JaegerTraces onlySimple UI, matureCassandra/Elasticsearch/memory
PrometheusMetrics onlyPromQL, huge ecosystemLocal TSDB
Grafana LokiLogs onlyLow cost, label-basedObject storage
Grafana TempoTraces onlyTraceQL, exemplar links, cheapObject storage
SigNozTraces + Metrics + LogsAll-in-one OTel-nativeClickHouse
Grafana StackAll signalsUnified correlation, dashboardsMultiple
PlatformData CollectionVendor Lock-inCost ModelCustom Backends
OpenTelemetryOTLP / agentsNone (open standard)Free (you host backends)Any
DatadogDD Agent (proprietary)HighPer-host + ingestionNo
New RelicNR Agent / OTelMediumPer GB ingestedLimited
DynatraceOneAgent (proprietary)Very highPer host unitNo
Elastic APMElastic Agent / OTelMediumPer GBElasticsearch only
Grafana CloudOTel-nativeLowUsage-based, free tierAny via self-host

OTel + Grafana Stack (Tempo + Prometheus + Loki) gives the lowest vendor lock-in and the best per-signal cost model for teams willing to operate their own backends.

Real-World Example: Node.js + Python with Grafana Stack

A practical two-service setup: a Node.js API gateway calls a Python order service. All telemetry flows through a single Collector to Tempo, Prometheus, and Loki, visualized in Grafana.

docker-compose.yml (excerpt):

services:
  collector:
    image: otel/opentelemetry-collector-contrib:latest
    volumes:
      - ./collector-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]
    ports: ["4317:4317", "4318:4318", "9464:9464"]

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    ports: ["3200:3200"]

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]

  loki:
    image: grafana/loki:latest
    ports: ["3100:3100"]

  grafana:
    image: grafana/grafana:latest
    environment:
      - GF_FEATURE_TOGGLES_ENABLE=traceqlEditor
    ports: ["3000:3000"]

  node-api:
    build: ./node-api
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
      - OTEL_SERVICE_NAME=node-api
      - OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

  python-orders:
    build: ./python-orders
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
      - OTEL_SERVICE_NAME=python-orders

The Node.js service uses auto-instrumentation via --require ./tracing.js. The Python service uses opentelemetry-instrument. Both emit OTLP to the Collector. The Collector’s tail-sampling processor keeps all error traces and 10% of healthy traces. In Grafana: open Explore → Tempo → search by service.name=node-api → click a trace → use the exemplar link to see the correlated Prometheus P95 latency → click the Loki link to see the exact log line from that request.

Gotchas and Edge Cases

Clock skew — Distributed traces rely on synchronized clocks. NTP drift over 1ms causes span ordering issues in the UI. Use chrony or cloud-native time sync; do not ignore clock warnings in Collector logs.

Tail sampling requires TraceId-consistent routing — If you run multiple Collector replicas for the gateway tier and use tail_sampling, all spans from the same trace must arrive at the same Collector instance. Use the loadbalancing exporter with routing_key: traceId in the agent tier to hash-route to the correct gateway.

OTLP gRPC requires HTTP/2 — Some reverse proxies (older nginx versions, AWS ALB without grpc-web) do not support HTTP/2 grpc. Use OTLP HTTP (port 4318) if you route through a proxy that terminates connections.

Cardinality explosion — Never use high-cardinality values (user IDs, order IDs, request IDs) as metric label values. OTel attributes on spans are fine at any cardinality; metric dimensions are not. The filter processor in the Collector can drop dangerous attributes before they reach Prometheus.

Auto-instrumentation version conflicts — The OTel Node.js SDK pins specific versions of instrumentation packages. If your app uses an older version of express or pg, the auto-instrumentation package may not match. Pin @opentelemetry/auto-instrumentations-node and check the compatibility matrix.

SDK initialization order — In Node.js, the SDK start() must complete before any instrumented module is require()d. Use --require to load the tracing file; do not use a dynamic import() inside the app entry point.

Summary

  • OpenTelemetry provides a single vendor-neutral API, SDK, and wire protocol (OTLP) for traces, metrics, and logs
  • The OTel Collector decouples applications from backends; use agents (DaemonSet) feeding a gateway (Deployment) with tail sampling in production
  • Auto-instrumentation (Java agent, Python opentelemetry-instrument, Node.js SDK) adds traces with zero code changes; manual instrumentation adds business context with startActiveSpan and createCounter/createHistogram
  • W3C TraceContext headers propagate trace identity across service boundaries automatically when using auto-instrumentation
  • Tail-based sampling in the Collector captures 100% of error traces while sampling a fraction of healthy traffic
  • Exemplars link Prometheus metric observations to specific Tempo trace IDs for one-click drill-down
  • The Grafana Stack (Tempo + Prometheus + Loki + Grafana) is the most popular open-source OTel backend combination
  • The OTel Operator manages Collector DaemonSets and auto-instrumentation injection via annotations in Kubernetes