DuckDB: In-Process Analytics Database for Data Science and ETL

TL;DR — Quick Summary

DuckDB is an in-process SQL analytics database. Query CSV, Parquet, and JSON files directly with SQL — no server, no setup. The SQLite of analytics.

DuckDB is SQL analytics without the server. Query CSV files with millions of rows, analyze Parquet datasets, and run complex aggregations — all from a single binary or a Python import. No database server, no setup, no configuration.

Installation

# macOS
brew install duckdb

# Python
pip install duckdb

# Node.js
npm install duckdb

# CLI binary
# Download from https://duckdb.org/docs/installation/

# Start interactive CLI
duckdb

Querying Files Directly

-- Query CSV
SELECT * FROM 'sales.csv' LIMIT 10;

-- Query with aggregation
SELECT category, SUM(revenue), COUNT(*)
FROM 'sales.csv'
GROUP BY category
ORDER BY SUM(revenue) DESC;

-- Query Parquet (much faster for large data)
SELECT * FROM 'data.parquet' WHERE date > '2025-01-01';

-- Glob patterns — query all files
SELECT * FROM 'logs/*.csv';
SELECT * FROM 'data/year=*/month=*/*.parquet';

-- Query JSON
SELECT * FROM read_json_auto('events.json');

-- Query remote files
SELECT * FROM 'https://example.com/data.csv';

Python Integration

import duckdb

# Query CSV directly
result = duckdb.sql("SELECT * FROM 'sales.csv'").df()

# Query pandas DataFrame
import pandas as pd
df = pd.read_csv('large_data.csv')
result = duckdb.sql("SELECT category, AVG(price) FROM df GROUP BY category").df()

# Persistent database
con = duckdb.connect('my_analytics.db')
con.sql("CREATE TABLE sales AS SELECT * FROM 'sales.csv'")
con.sql("SELECT * FROM sales WHERE revenue > 1000").show()

ETL Pipelines

-- CSV to Parquet (huge compression + speed gain)
COPY (SELECT * FROM 'raw_data.csv') TO 'optimized.parquet' (FORMAT PARQUET);

-- Combine multiple CSVs into one Parquet
COPY (SELECT * FROM 'logs/*.csv') TO 'all_logs.parquet' (FORMAT PARQUET);

-- Export to JSON
COPY (SELECT * FROM analysis) TO 'results.json' (FORMAT JSON);

Comparison

Feature	DuckDB	SQLite	PostgreSQL	pandas
Type	OLAP	OLTP	OLTP/OLAP	Library
Server	No	No	Yes	No
Query CSV	Direct	No	COPY	read_csv
Query Parquet	Direct	No	Extension	read_parquet
Columnar	Yes	No	No	Yes
SQL Support	Full	Full	Full	Limited
Speed (analytics)	Fast	Slow	Fast	Moderate
Language bindings	Many	Many	Many	Python

Summary

DuckDB runs analytical SQL queries without a server — embed in any application
Query CSV, Parquet, JSON, and Excel files directly with standard SQL
Columnar storage and vectorized execution for fast analytical queries
Python, R, Node.js, Java, and many more language bindings
Perfect for ETL pipelines, data exploration, and ad-hoc analytics

SQLite in Production

Guide & Instructions

Estimated Time: 10m

Tools Needed:

DuckDB CLI
Terminal

Install DuckDB

Install with brew install duckdb, pip install duckdb, or download from duckdb.org. No server setup needed.

Query a CSV file

Run duckdb then SELECT * FROM 'data.csv' LIMIT 10; DuckDB auto-detects column types and reads the file.

Query Parquet files

SELECT count(*), avg(price) FROM 'sales/*.parquet' WHERE year = 2025; Works with glob patterns across multiple files.

Export results

COPY (SELECT * FROM data) TO 'output.parquet' (FORMAT PARQUET); Export to CSV, Parquet, JSON, or Excel.

Use from Python

import duckdb; result = duckdb.sql('SELECT * FROM data.csv').df() returns a pandas DataFrame.

Frequently Asked Questions

What is DuckDB?

DuckDB is an in-process analytical SQL database management system. Often called 'the SQLite of analytics,' it runs inside your application with no separate server. It excels at analytical queries on CSV, Parquet, JSON, and other file formats.

How is DuckDB different from SQLite?

SQLite is optimized for transactional (OLTP) workloads — many small reads/writes. DuckDB is optimized for analytical (OLAP) workloads — scanning large amounts of data, aggregations, joins on big tables. DuckDB uses columnar storage and vectorized execution.

Can DuckDB query CSV and Parquet directly?

Yes, DuckDB can query CSV, Parquet, JSON, and Excel files directly without importing them first. Just use SELECT * FROM 'data.csv' or SELECT * FROM 'data.parquet' in your SQL.

Is DuckDB free?

Yes, DuckDB is free and open source (MIT license). It has bindings for Python, R, Node.js, Java, C++, Rust, Go, and more.