What makes ClickHouse faster than traditional databases for analytics?

ClickHouse reads only queried columns, uses vectorized execution, LZ4/ZSTD compression, and sparse primary indexes — delivering 100-1000x faster aggregations than row-oriented databases.

When should I use ClickHouse instead of PostgreSQL?

Use ClickHouse for append-mostly analytical workloads with billions of rows. Use PostgreSQL for transactional OLTP workloads requiring frequent updates and joins on normalized schemas.

Does ClickHouse support real-time data ingestion?

Yes. ClickHouse ingests millions of rows per second via its HTTP interface, native client, Kafka engine, or S3 engine. Inserts are batched internally by the MergeTree engine for optimal write throughput.

What is the MergeTree engine family?

MergeTree is ClickHouse's primary storage engine. Variants include ReplacingMergeTree for deduplication, SummingMergeTree for pre-aggregation, AggregatingMergeTree for state-based aggregates, and CollapsingMergeTree for mutable rows.

ClickHouse: Real-Time Analytics and OLAP with Columnar Storage

TL;DR — Quick Summary

Complete ClickHouse guide: columnar storage, MergeTree engines, replication, materialized views, data ingestion, and a practical log analytics example.

ClickHouse is an open-source column-oriented DBMS built for real-time OLAP queries at scale. Developed by Yandex in 2016, it processes billions of rows per second on commodity hardware by combining columnar storage, vectorized query execution, and aggressive data compression. This guide covers everything from architecture internals and installation to replication, materialized views, and a practical log analytics walkthrough.

Prerequisites

Linux host or Docker with at least 4 GB RAM (16 GB+ recommended for production).
Basic SQL knowledge.
For replication: ZooKeeper 3.6+ or ClickHouse Keeper (bundled since v22.4).

ClickHouse Architecture

Columnar Storage

Traditional row-oriented databases store all columns of a row together on disk. ClickHouse stores each column in a separate file. When a query touches only 3 of 100 columns, ClickHouse reads roughly 3% of the data. This column-affinity also enables far better compression ratios because repeated similar values pack tightly.

Vectorized Query Execution

ClickHouse processes data in batches of 8,192–65,536 rows at a time (the chunk or block). SIMD CPU instructions operate on entire vectors of values simultaneously. Combined with query compilation via LLVM, aggregations that would take minutes on a row store finish in seconds.

MergeTree Engine Family

The MergeTree family is ClickHouse’s workhorse. Data arrives as parts (sorted, immutable on-disk chunks) and is periodically merged in the background:

Engine	Best For
MergeTree	Append-only analytics, base engine
ReplacingMergeTree	Last-write-wins deduplication
SummingMergeTree	Pre-aggregated counters/sums
AggregatingMergeTree	Custom aggregate states (used with materialized views)
CollapsingMergeTree	Mutable rows via sign column
ReplicatedMergeTree	HA replication via Keeper/ZooKeeper

Data Compression

ClickHouse defaults to LZ4 (fast decompression) and supports ZSTD (higher ratio). Per-column codecs stack: CODEC(Delta, LZ4) reduces timestamp columns by 90%+. Typical compression ratios are 5–15× over raw data.

Approximate Query Processing

For interactive dashboards, ClickHouse supports SAMPLE BY on MergeTree tables. Adding SAMPLE 0.01 to a query processes 1% of data while returning statistically representative results in milliseconds.

Installation

apt (Ubuntu/Debian)

apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' \
  | gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] \
  https://packages.clickhouse.com/deb stable main" \
  > /etc/apt/sources.list.d/clickhouse.list
apt-get update
apt-get install -y clickhouse-server clickhouse-client
systemctl enable --now clickhouse-server

yum/dnf (RHEL/Rocky/AlmaLinux)

yum install -y yum-utils
yum-config-manager --add-repo \
  https://packages.clickhouse.com/rpm/clickhouse.repo
yum install -y clickhouse-server clickhouse-client
systemctl enable --now clickhouse-server

Docker

docker run -d \
  --name clickhouse \
  --ulimit nofile=262144:262144 \
  -p 8123:8123 -p 9000:9000 \
  -v clickhouse-data:/var/lib/clickhouse \
  clickhouse/clickhouse-server:latest

Port 8123 = HTTP interface. Port 9000 = native TCP (used by clickhouse-client).

Initial Configuration

config.xml — Network and Storage

Edit /etc/clickhouse-server/config.xml:

<!-- Accept connections from all interfaces -->
<listen_host>0.0.0.0</listen_host>

<!-- Data and logs storage paths -->
<path>/var/lib/clickhouse/</path>
<tmp_path>/var/lib/clickhouse/tmp/</tmp_path>
<user_files_path>/var/lib/clickhouse/user_files/</user_files_path>

users.xml — Profiles and Quotas

Edit /etc/clickhouse-server/users.xml:

<profiles>
  <default>
    <max_memory_usage>10000000000</max_memory_usage>  <!-- 10 GB -->
    <max_threads>8</max_threads>
    <use_uncompressed_cache>0</use_uncompressed_cache>
  </default>
  <readonly>
    <readonly>1</readonly>
  </readonly>
</profiles>
<quotas>
  <default>
    <interval>
      <duration>3600</duration>
      <queries>100</queries>
      <errors>10</errors>
    </interval>
  </default>
</quotas>

Set a strong password for the default user or create a dedicated user per application.

Table Design and Schema

Data Types

Concern	Recommended Type
String columns with < 10k distinct values	`LowCardinality(String)`
Nullable fields	Avoid; use sentinel values (`''`, `0`)
High-precision timestamps	`DateTime64(3)` (milliseconds)
IPv4/IPv6	`IPv4`, `IPv6` native types
JSON blobs	`String` + JSON functions or `Tuple`

LowCardinality wraps a dictionary encoding around any type, cutting memory and improving compression for repeated values like HTTP status codes, country codes, or log levels.

Partitioning and Ordering

CREATE TABLE events
(
    event_date   Date,
    event_time   DateTime64(3),
    user_id      UInt64,
    action       LowCardinality(String),
    revenue      Decimal(18, 4),
    properties   String
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (action, user_id, event_time)
SETTINGS index_granularity = 8192;

PARTITION BY controls physical data separation. Monthly partitions (toYYYYMM) allow instant partition drops for data retention.
ORDER BY determines the sparse primary index and physical sort order. Put high-cardinality filter columns first.
Primary key vs sorting key: The primary key defaults to the ORDER BY key but can be a prefix of it. Use PRIMARY KEY (action) to keep a shorter index while still sorting by (action, user_id, event_time).

Data Ingestion

clickhouse-client

# Interactive
clickhouse-client --host=localhost --user=default --password=secret

# Batch insert from CSV
clickhouse-client --query="INSERT INTO events FORMAT CSVWithNames" < events.csv

HTTP Interface

curl -X POST 'http://localhost:8123/?query=INSERT+INTO+events+FORMAT+JSONEachRow' \
  --data-binary '{"event_date":"2026-03-23","user_id":1,"action":"click",...}'

Kafka Engine

The Kafka engine reads directly from Kafka topics:

CREATE TABLE events_kafka
(
    event_time DateTime64(3),
    user_id    UInt64,
    action     LowCardinality(String)
)
ENGINE = Kafka
SETTINGS
    kafka_broker_list = 'broker:9092',
    kafka_topic_list = 'events',
    kafka_group_name = 'clickhouse-consumer',
    kafka_format = 'JSONEachRow';

-- Materialized view to push from Kafka engine to MergeTree
CREATE MATERIALIZED VIEW events_mv TO events AS
SELECT * FROM events_kafka;

S3 Engine

SELECT count(), sum(revenue)
FROM s3(
    'https://s3.us-east-1.amazonaws.com/bucket/events/2026-03-*.parquet',
    'ACCESS_KEY', 'SECRET_KEY',
    'Parquet'
);

clickhouse-local

Process files without a running server:

clickhouse-local --query="
  SELECT action, count() as cnt
  FROM file('events.parquet', Parquet)
  GROUP BY action ORDER BY cnt DESC LIMIT 10
"

SQL Dialect and Extensions

ClickHouse is ANSI SQL–compatible with powerful extensions:

-- arrayJoin: explode array column into rows
SELECT user_id, arrayJoin(tags) AS tag
FROM posts;

-- SAMPLE: approximate query on 1% of data
SELECT count() FROM events SAMPLE 0.01;

-- FINAL: force deduplication read for ReplacingMergeTree
SELECT * FROM users FINAL WHERE user_id = 42;

-- WITH ROLLUP: multi-level subtotals
SELECT action, toDate(event_time) AS day, sum(revenue)
FROM events
GROUP BY action, day WITH ROLLUP;

-- Window functions
SELECT user_id,
       revenue,
       sum(revenue) OVER (PARTITION BY user_id ORDER BY event_time) AS running_total
FROM events;

-- Lambda functions
SELECT arrayFilter(x -> x > 100, [50, 150, 200]) AS big_values;

Dictionaries for Lookups

CREATE DICTIONARY geo_country (
    ip_prefix  String,
    country    String
)
PRIMARY KEY ip_prefix
SOURCE(HTTP(URL 'https://...'))
LAYOUT(HASHED())
LIFETIME(3600);

SELECT dictGet('geo_country', 'country', client_ip) AS country, count()
FROM access_log GROUP BY country;

Replication and Sharding

ReplicatedMergeTree

Replace MergeTree() with ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}'). ClickHouse Keeper (or ZooKeeper) coordinates replicas.

CREATE TABLE events ON CLUSTER my_cluster
(
    event_date Date,
    event_time DateTime64(3),
    user_id    UInt64,
    action     LowCardinality(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (action, user_id, event_time);

Distributed Table Engine

A Distributed table is a virtual layer that routes queries across shards:

CREATE TABLE events_dist ON CLUSTER my_cluster
AS events
ENGINE = Distributed(my_cluster, default, events, rand());

Write to events_dist and reads are fanned out to all shards automatically.

Materialized Views

Materialized views in ClickHouse fire on every INSERT into the source table — effectively creating real-time aggregation pipelines:

-- Target: AggregatingMergeTree stores intermediate states
CREATE TABLE events_hourly_agg
(
    action     LowCardinality(String),
    hour       DateTime,
    cnt        AggregateFunction(count),
    total_rev  AggregateFunction(sum, Decimal(18,4))
)
ENGINE = AggregatingMergeTree()
ORDER BY (action, hour);

-- View triggers on INSERT to events
CREATE MATERIALIZED VIEW events_hourly_mv
TO events_hourly_agg AS
SELECT
    action,
    toStartOfHour(event_time) AS hour,
    countState()              AS cnt,
    sumState(revenue)         AS total_rev
FROM events
GROUP BY action, hour;

-- Query merges states on read
SELECT action, hour,
       countMerge(cnt)      AS requests,
       sumMerge(total_rev)  AS revenue
FROM events_hourly_agg
GROUP BY action, hour
ORDER BY hour DESC;

Refreshable Materialized Views (v23.12+) allow periodic full recalculations — useful for complex transformations that don’t compose incrementally:

CREATE MATERIALIZED VIEW monthly_cohorts
REFRESH EVERY 1 HOUR AS
SELECT ...
FROM events
WHERE event_date >= today() - 90;

Monitoring

system Tables

-- Slow queries in the last hour
SELECT query, query_duration_ms, read_rows, memory_usage
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
  AND query_duration_ms > 5000
ORDER BY query_duration_ms DESC LIMIT 20;

-- Current server metrics
SELECT metric, value FROM system.metrics
WHERE metric IN ('Query', 'Connection', 'MemoryTracking');

Prometheus Endpoint

Enable in config.xml:

<prometheus>
    <endpoint>/metrics</endpoint>
    <port>9363</port>
    <metrics>true</metrics>
    <events>true</events>
    <asynchronous_metrics>true</asynchronous_metrics>
</prometheus>

Scrape http://clickhouse-host:9363/metrics with Prometheus. The official Grafana ClickHouse plugin (grafana-clickhouse-datasource) connects directly via the native protocol for live dashboard queries.

ClickHouse vs Alternatives

	ClickHouse	TimescaleDB	Apache Druid	DuckDB	BigQuery	Elasticsearch
Storage	Columnar	Row + compression	Columnar segments	Columnar	Columnar	Inverted index
Query speed	Very fast	Fast	Fast	Very fast	Fast	Medium
Ingestion rate	Very high	High	High	Low	High	High
Streaming	Kafka engine	Continuous aggs	Native	No	Pub/Sub	Logstash
Self-hosted	Yes	Yes	Yes	Yes	No	Yes
Best for	OLAP/logs/events	Time-series + SQL	Event analytics	Local files/BI	Serverless DW	Full-text search
Scale model	Sharding + replication	Hypertable	Cluster	Single node	Serverless	Cluster

Practical Example: Log Analytics

Real-world scenario: you receive 500 million nginx access log lines per day and need sub-second dashboards.

-- Schema
CREATE TABLE nginx_logs
(
    log_date     Date,
    log_time     DateTime64(3),
    remote_addr  IPv4,
    method       LowCardinality(String),
    uri          String,
    status       UInt16,
    bytes_sent   UInt64,
    request_time Float32,
    referer      String,
    user_agent   String
)
ENGINE = MergeTree()
PARTITION BY log_date
ORDER BY (status, log_time)
TTL log_date + INTERVAL 90 DAY DELETE
SETTINGS index_granularity = 8192;

-- Real-time P95 latency per endpoint (materialized view)
CREATE TABLE nginx_latency_agg
(
    uri          String,
    hour         DateTime,
    p95_state    AggregateFunction(quantile(0.95), Float32)
)
ENGINE = AggregatingMergeTree()
ORDER BY (uri, hour);

CREATE MATERIALIZED VIEW nginx_latency_mv TO nginx_latency_agg AS
SELECT
    uri,
    toStartOfHour(log_time) AS hour,
    quantileState(0.95)(request_time) AS p95_state
FROM nginx_logs
GROUP BY uri, hour;

-- Query: top 10 slowest endpoints today
SELECT uri,
       quantileMerge(0.95)(p95_state) AS p95_ms
FROM nginx_latency_agg
WHERE hour >= toStartOfDay(now())
GROUP BY uri
ORDER BY p95_ms DESC LIMIT 10;

Insert via clickhouse-client piped from tail -F /var/log/nginx/access.log or the Kafka engine for distributed collection.

Gotchas and Edge Cases

FINAL is expensive — SELECT ... FINAL on large ReplacingMergeTree tables forces a full merge pass at read time. Prefer optimize_on_insert = 1 or schedule OPTIMIZE TABLE ... FINAL off-peak.
Small inserts kill performance — Never insert row-by-row. Batch at least 10,000 rows per INSERT or use the async insert mode (async_insert = 1).
JOIN order matters — Always put the smaller table on the right side of a JOIN. ClickHouse loads the right-side table into memory as a hash table.
Nullable overhead — Every Nullable(T) column adds a separate null-map bitfield. Avoid unless nulls are semantically required.
Partition granularity — Partitioning by day on a high-volume table creates too many small parts. Prefer monthly partitions and rely on ORDER BY for pruning.

Summary

Columnar + vectorized execution delivers 100-1000× faster aggregations than OLTP databases for analytical workloads.
MergeTree family covers dedup (Replacing), pre-aggregation (Summing, Aggregating), and mutability (Collapsing).
Schema design is the most impactful tuning knob: ORDER BY for query patterns, PARTITION BY for retention, LowCardinality for enums.
Materialized views + AggregatingMergeTree create zero-latency aggregation pipelines that update on every insert.
Kafka and S3 engines eliminate external ETL for streaming and batch ingestion.
Replication via ReplicatedMergeTree + ClickHouse Keeper provides HA without additional infrastructure.