Paperless-ngx: Self-Hosted Document Management with OCR and Auto-Tagging

Why Paperless-ngx?

Stacks of paper, scattered PDFs, endless email attachments — Paperless-ngx tames the chaos:

OCR everything — Extract text from scans, photos, and PDFs.
Auto-classify — Rules assign tags, correspondents, and types automatically.
Full-text search — Find any document by its content in seconds.
Email ingestion — Automatically import email attachments.
Web UI — Modern, responsive dashboard with previews.

Prerequisites

Docker with docker-compose.
At least 1 GB RAM (Tesseract OCR is RAM-hungry).
Storage space for your documents.

Step 1: Deploy with Docker Compose

# docker-compose.yml
version: "3"
services:
  paperless-redis:
    image: redis:7
    restart: always

  paperless-db:
    image: postgres:16
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: changeme
    volumes:
      - pgdata:/var/lib/postgresql/data
    restart: always

  paperless:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    depends_on:
      - paperless-redis
      - paperless-db
    ports:
      - "8000:8000"
    environment:
      PAPERLESS_REDIS: redis://paperless-redis:6379
      PAPERLESS_DBHOST: paperless-db
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: changeme
      PAPERLESS_OCR_LANGUAGE: eng+spa
    volumes:
      - ./data:/usr/src/paperless/data
      - ./media:/usr/src/paperless/media
      - ./consume:/usr/src/paperless/consume
    restart: always

volumes:
  pgdata:

docker compose up -d

Access at http://your-server:8000.

Step 2: Consumption Workflow

Drop files into the consume/ directory:

cp invoice-2026-02.pdf /path/to/consume/

Paperless automatically:

Detects the new file.
Runs OCR (Tesseract) to extract text.
Applies matching rules to assign tags/correspondent/type.
Stores the original and creates a searchable archive version.

Step 3: Auto-Tagging Rules

Rule Type	Example	Use Case
Tag	Contains “electricity” → tag “Utilities”	Categorize by topic
Correspondent	Contains “Telmex” → correspondent “Telmex”	Identify sender
Document Type	Contains “factura” → type “Invoice”	Classify document kind
Storage Path	Year/Correspondent/	Organize filesystem

Step 4: Email Ingestion

# Add to docker-compose environment
PAPERLESS_EMAIL_HOST: imap.gmail.com
PAPERLESS_EMAIL_PORT: 993
PAPERLESS_EMAIL_USERNAME: docs@example.com
PAPERLESS_EMAIL_PASSWORD: app-password
PAPERLESS_EMAIL_INBOX: INBOX

Paperless checks for new emails with attachments and ingests them automatically.

Troubleshooting

Problem	Solution
OCR produces garbage text	Install the correct language pack: `PAPERLESS_OCR_LANGUAGE: eng+spa+deu`
Document stuck in “Processing”	Check container logs: `docker compose logs paperless`; usually a Tesseract crash on corrupt files
Duplicate documents detected	Paperless has built-in duplicate detection via content hash — this is expected behavior
Search returns no results	Rebuild the search index: `docker compose exec paperless document_index reindex`
Email ingestion not working	Test IMAP credentials manually; ensure “Less secure apps” or app-specific password is configured

Summary

Drop files into a folder — Paperless handles OCR and classification.
Matching rules auto-tag documents by content, saving manual work.
Full-text search finds any document in seconds.
Email ingestion automates document input.

Frequently Asked Questions

What is Paperless-ngx and why use it?

Paperless-ngx is a self-hosted document management system that scans, OCRs, and organizes your documents. Drop a PDF, image, or scan into the consumption folder, and Paperless automatically extracts the text via OCR (Tesseract), matches it against your rules to assign tags, correspondents, and document types, and makes everything full-text searchable. It replaces filing cabinets and paid services like DocuWare.

What file types does Paperless-ngx support?

PDF (native and scanned), PNG, JPEG, TIFF, WEBP, and even Office documents (DOCX, XLSX). PDFs with embedded text are indexed directly. Scanned PDFs and images go through Tesseract OCR to extract text. It supports 100+ languages for OCR.

How does automatic tagging work?

You create matching rules: if a document contains a keyword (e.g., 'electricity bill'), Paperless auto-assigns a tag (e.g., 'Utilities'), a correspondent (e.g., 'CFE'), and a document type (e.g., 'Invoice'). Rules can match by exact text, regex, or fuzzy matching.

Can Paperless-ngx ingest documents from email?

Yes. Configure an IMAP email account in Paperless, and it will periodically check for new emails with attachments. Documents are downloaded, OCR'd, and added to the system automatically.