Skip to content

Apache Parquet

Key idea:

Apache Parquet — columnar storage format developed by Twitter + Cloudera (2013), ASF top-level project. Default format for analytics on a data lake. Benefits: 10-100x compression vs CSV, column pruning (read only needed columns), predicate pushdown (filter applied in reader). Write Spark/Pandas/DuckDB, read anywhere.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Columnar: data stored by columns, not rows
  • Compression: Snappy, Gzip, ZSTD, LZ4 per column
  • Encoding: dictionary, RLE, bit-packing, delta for efficient storage
  • Predicate pushdown: WHERE date > 2026-01-01 at reader level, not scan
  • Schema embedded in file → self-describing

Example

# Python (pandas)
import pandas as pd
df = pd.read_csv('sales.csv')
df.to_parquet('sales.parquet', compression='snappy')

# Read only needed columns (column pruning)
df = pd.read_parquet('sales.parquet', columns=['date', 'amount'])

# DuckDB — query parquet directly
duckdb.sql("SELECT SUM(amount) FROM 'sales.parquet' WHERE date > '2026-01-01'")

Related Terms

Learn more

Frequently Asked Questions

Parquet vs CSV?

CSV: human-readable, row-oriented, no schema, no compression. Parquet: binary, columnar, schema embedded, 10-100x smaller. For analytics — always Parquet.

Parquet vs ORC?

Similar columnar formats. ORC: Hive ecosystem history, better indexing. Parquet: broader adoption (Spark default). 2026 — Parquet wins.

Typical size?

CSV 1 GB → Parquet 50-200 MB with Snappy compression (5-20x). ZSTD gives 2-3x better but at CPU cost.