TL;DR — Quick Summary

Complete Envoy Proxy guide for service mesh and edge proxy. Covers xDS API, load balancing, mTLS, observability, circuit breaking, and Kubernetes sidecar setup.

Envoy Proxy is the data plane at the heart of modern service meshes. Originally built at Lyft to solve microservice observability and reliability at scale, Envoy is now the de-facto sidecar for Istio, the edge proxy for many ingress controllers, and a general-purpose L3/L4/L7 proxy used by Google, AWS, and thousands of companies. This guide covers Envoy’s architecture, static and dynamic configuration, load balancing algorithms, observability pipeline, TLS management, and advanced filters — everything needed to run Envoy as a front proxy or service mesh data plane.

Prerequisites

  • Docker (for standalone testing) or a Kubernetes cluster.
  • Basic understanding of HTTP, TLS, and reverse proxy concepts.
  • Familiarity with YAML configuration syntax.
  • curl and optionally jq for testing endpoints.

Envoy Architecture

Envoy operates as an out-of-process network proxy — it runs alongside your application rather than as a library inside it. This keeps the proxy language-agnostic and allows independent upgrades.

Core components:

  • Listeners — Network ports Envoy binds to (downstream connections arrive here).
  • Filter chains — Ordered list of network and HTTP filters applied to each connection.
  • Clusters — Named groups of upstream endpoints (your backend services).
  • Endpoints — Individual IP:port pairs within a cluster, discovered via EDS or static config.
  • Routes — Rules mapping incoming requests to clusters based on path, header, or query parameters.

Thread model: Envoy uses a single main thread for management plus one worker thread per CPU core. Each worker thread independently handles connections using non-blocking I/O via libevent. There is no lock contention on the hot path — each worker has its own connection pool.

Hot restart: Envoy supports zero-downtime binary upgrades via a shared-memory handshake between the old and new process. The new process takes ownership of existing connections without dropping traffic — critical for production deployments.


Static Configuration (envoy.yaml)

The fastest way to start is a static YAML config with all resources defined inline:

admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9901

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 10000
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: backend
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: service_backend
                            timeout: 15s
                            retry_policy:
                              retry_on: "5xx,reset,connect-failure"
                              num_retries: 3
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: service_backend
      connect_timeout: 0.5s
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      load_assignment:
        cluster_name: service_backend
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: backend-service
                      port_value: 8080
      health_checks:
        - timeout: 1s
          interval: 10s
          unhealthy_threshold: 2
          healthy_threshold: 2
          http_health_check:
            path: /health

The admin block exposes Envoy’s management API on port 9901. Use it to query /stats, /clusters, /config_dump, and /healthcheck/fail for circuit-breaker testing.


Dynamic Configuration: The xDS API

Static configs work for simple deployments but become unwieldy at scale. Envoy’s xDS (x Discovery Service) protocol lets a control plane push configuration changes at runtime — no reload, no restart.

xDS resource types:

APIManages
LDS (Listener Discovery Service)Listeners and filter chains
RDS (Route Discovery Service)Virtual hosts and route tables
CDS (Cluster Discovery Service)Cluster definitions and policies
EDS (Endpoint Discovery Service)Individual endpoint IP:port health
SDS (Secret Discovery Service)TLS certificates and private keys

ADS (Aggregated Discovery Service): Combines all xDS APIs into a single bidirectional gRPC stream. This is the recommended mode because it guarantees ordering — a new cluster is always delivered before the route that references it, preventing temporary 503 errors during updates.

Delta xDS: Rather than sending the full state on every update, delta xDS sends only the added, modified, or removed resources. Essential for large meshes with thousands of clusters.

To enable dynamic config, replace static_resources with a dynamic_resources block pointing at your control plane:

dynamic_resources:
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services:
      - envoy_grpc:
          cluster_name: xds_cluster
  lds_config:
    resource_api_version: V3
    ads: {}
  cds_config:
    resource_api_version: V3
    ads: {}

Control planes that implement xDS: Istio istiod, Consul Connect, solo.io Gloo, and the reference go-control-plane library for custom implementations.


Load Balancing Algorithms

Envoy supports six load balancing policies selectable per cluster:

PolicyBest For
ROUND_ROBINUniform backend capacity, default choice
LEAST_REQUESTVariable request duration, avoids hot backends
RING_HASHConsistent hashing — cache affinity, stateful services
RANDOMSimple, low overhead, resilient to slow endpoints
MAGLEVGoogle’s consistent hash — more even distribution than ring hash
CLUSTER_PROVIDEDDelegates decision to upstream cluster type

Least Request uses a power-of-two random choices algorithm: picks two random endpoints and routes to the one with fewer active requests. This outperforms round-robin when request durations vary significantly.

Ring Hash maps requests to endpoints using a consistent hash ring. Useful for upstream caches where the same key should always reach the same backend. Configure minimum_ring_size (default 1024) and maximum_ring_size for distribution quality.


Observability

Envoy is opinionated about observability — it was built to make distributed systems debuggable. Three built-in pillars:

Stats: Envoy emits thousands of counters, gauges, and histograms. Expose them to Prometheus at the admin endpoint:

curl http://localhost:9901/stats/prometheus

Key metrics: envoy_cluster_upstream_rq_total, envoy_cluster_upstream_rq_time, envoy_http_downstream_rq_5xx, envoy_cluster_circuit_breakers_default_open.

Distributed Tracing: Envoy automatically generates and propagates trace context headers for Jaeger, Zipkin, and OpenTelemetry. Add a tracing block to the HttpConnectionManager:

tracing:
  provider:
    name: envoy.tracers.opentelemetry
    typed_config:
      "@type": type.googleapis.com/envoy.config.trace.v3.OpenTelemetryConfig
      grpc_service:
        envoy_grpc:
          cluster_name: opentelemetry_collector
      service_name: my-service

Envoy generates the x-request-id header for correlation and propagates traceparent / b3 headers downstream. Your application only needs to forward these headers — Envoy handles trace creation.

Access Logging: Structured JSON access logs with all request metadata:

access_log:
  - name: envoy.access_loggers.stdout
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
      log_format:
        json_format:
          start_time: "%START_TIME%"
          method: "%REQ(:METHOD)%"
          path: "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%"
          response_code: "%RESPONSE_CODE%"
          duration: "%DURATION%"
          upstream_cluster: "%UPSTREAM_CLUSTER%"
          bytes_sent: "%BYTES_SENT%"

TLS and mTLS with SDS

Manually distributing certificates across hundreds of services does not scale. Envoy’s Secret Discovery Service (SDS) solves this: certificates are fetched from a control plane at runtime and rotated without process restart.

For mutual TLS between services, configure a cluster’s transport_socket:

transport_socket:
  name: envoy.transport_sockets.tls
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
    common_tls_context:
      tls_certificate_sds_secret_configs:
        - name: client_cert
          sds_config:
            resource_api_version: V3
            ads: {}
      combined_validation_context:
        default_validation_context:
          match_subject_alt_names:
            - exact: "spiffe://cluster.local/ns/default/sa/backend"
        validation_context_sds_secret_config:
          name: validation_context
          sds_config:
            resource_api_version: V3
            ads: {}

The match_subject_alt_names field enforces SPIFFE identity — only connections from services with the expected SPIFFE URI are accepted. This is how Istio implements zero-trust networking: every pod-to-pod connection is mutually authenticated via short-lived certificates rotated by SPIRE.


Advanced Filters

Circuit Breaking: Prevents cascade failures by limiting pending requests, retries, and connections:

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 1000
      max_pending_requests: 1000
      max_requests: 1000
      max_retries: 3

Outlier Detection: Automatically ejects unhealthy hosts from the load balancing pool:

outlier_detection:
  consecutive_5xx: 5
  interval: 10s
  base_ejection_time: 30s
  max_ejection_percent: 10

After five consecutive 5xx responses, the endpoint is ejected for 30 seconds. max_ejection_percent prevents ejecting all hosts when the upstream degrades globally.

Fault Injection: Inject latency or errors into a percentage of requests for chaos testing:

- name: envoy.filters.http.fault
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.fault.v3.HTTPFault
    delay:
      fixed_delay: 2s
      percentage:
        numerator: 10
        denominator: HUNDRED
    abort:
      http_status: 503
      percentage:
        numerator: 5
        denominator: HUNDRED

External Authorization: Delegate authorization decisions to an external gRPC service (e.g., OPA, Ory Keto). The ext_authz filter sends request headers to the authz service before forwarding upstream — enabling policy-as-code without application changes.

Wasm Extensions: Envoy supports WebAssembly filter plugins for custom business logic in any language that compiles to Wasm (Go, Rust, C++, AssemblyScript). Wasm filters are hot-reloaded via remote fetch without binary upgrades.


Envoy vs Other Proxies

FeatureEnvoyNginxHAProxyTraefikLinkerdMOSN
Dynamic configxDS API (no reload)nginx -s reloadRuntime APIAuto-discovers K8sxDS (limited)xDS API
Service meshYes (Istio, Consul)NoNoNo (ingress only)Yes (Linkerd2)Yes (MOSN mesh)
L7 protocolsHTTP/1.1, HTTP/2, gRPC, Thrift, KafkaHTTP/1.1, HTTP/2HTTP/1.1, HTTP/2HTTP/1.1, HTTP/2HTTP/1.1, HTTP/2, gRPCHTTP/1.1, HTTP/2, Dubbo
ObservabilityBuilt-in stats + tracingModule-basedStats socketPrometheus pluginBuilt-in golden signalsBuilt-in stats
mTLSSDS + SPIFFEManual certsManual certsManual certsAutomaticSDS
Wasm filtersYesNoNoNoNoYes
Config languageYAML/Protobufnginx.confhaproxy.cfgYAML/LabelsLinkerd CRDsYAML/JSON

Practical Example: Envoy as a Front Proxy

A real-world scenario: you have three microservices (users, orders, products) behind Envoy as an edge proxy, with traffic split between v1 and v2 of the orders service for canary deployment.

virtual_hosts:
  - name: microservices
    domains: ["api.example.com"]
    routes:
      - match:
          prefix: "/users"
        route:
          cluster: users_service
      - match:
          prefix: "/products"
        route:
          cluster: products_service
      - match:
          prefix: "/orders"
        route:
          weighted_clusters:
            clusters:
              - name: orders_v1
                weight: 90
              - name: orders_v2
                weight: 10
            total_weight: 100

This weighted cluster configuration sends 10% of /orders traffic to the v2 canary without any application code changes. Envoy’s stats will show per-cluster request rates — allowing you to compare error rates and latencies before shifting more traffic.


Gotchas and Edge Cases

  • Header case sensitivity: HTTP/2 headers are lowercase by default. Envoy normalizes headers — ensure your upstream services handle lowercase content-type, authorization, etc.
  • Upstream timeouts vs route timeouts: Cluster connect_timeout (TCP) and route timeout (request) are independent. A missing route timeout defaults to 15 seconds — set it explicitly.
  • Retry budget: Without retry_on limits, retries under load can amplify failures. Always pair retries with retry_priority and a circuit breaker.
  • EDS vs STRICT_DNS: Use EDS for dynamic service discovery via a control plane. Use STRICT_DNS or LOGICAL_DNS for simpler setups where DNS resolves the upstream. STATIC is for fixed IP:port lists.
  • Wasm filter isolation: Each Wasm VM instance is isolated per worker thread, so plugin initialization runs once per thread. Shared state across workers requires external storage (Redis, etc.).

Troubleshooting

SymptomLikely CauseFix
503 upstream_reset_before_response_startedUpstream closed connection before respondingCheck upstream health check path; increase connect_timeout
404 from Envoy (not upstream)No matching routeRun /config_dump on admin port; check virtual host domain match
Circuit breaker open in statsUpstream overwhelmedIncrease max_pending_requests or scale upstream
mTLS handshake failureCertificate SAN mismatchVerify match_subject_alt_names matches actual SPIFFE URI
High P99 latencyThread starvationIncrease worker thread count via concurrency in bootstrap config
xDS update not appliedControl plane version mismatchEnsure control plane uses xDS v3 proto; check Envoy version compatibility

Summary

  • xDS API enables fully dynamic configuration without restarts — clusters, routes, listeners, and certificates all update live.
  • Load balancing offers six algorithms including Ring Hash for cache affinity and Least Request for heterogeneous workloads.
  • Built-in observability provides Prometheus stats, distributed tracing headers, and structured JSON access logs out of the box.
  • mTLS via SDS + SPIFFE delivers zero-trust networking with short-lived, automatically rotated certificates.
  • Advanced filters (circuit breaking, outlier detection, fault injection, ext_authz, Wasm) make Envoy extensible without touching application code.
  • Front proxy or sidecar — Envoy works standalone as an edge proxy or as the Istio/Consul data plane sidecar.