Skip to content

Apache Parquet

Key idea:

Apache Parquet — columnar storage format developed by Twitter + Cloudera (2013), ASF top-level project. Default format for analytics on a data lake. Benefits: 10-100x compression vs CSV, column pruning (read only needed columns), predicate pushdown (filter applied in reader). Write Spark/Pandas/DuckDB, read anywhere.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Columnar: data stored by columns, not rows
  • Compression: Snappy, Gzip, ZSTD, LZ4 per column
  • Encoding: dictionary, RLE, bit-packing, delta for efficient storage
  • Predicate pushdown: WHERE date > 2026-01-01 at reader level, not scan
  • Schema embedded in file → self-describing

Example

# Python (pandas)
import pandas as pd
df = pd.read_csv('sales.csv')
df.to_parquet('sales.parquet', compression='snappy')

# Read only needed columns (column pruning)
df = pd.read_parquet('sales.parquet', columns=['date', 'amount'])

# DuckDB — query parquet directly
duckdb.sql("SELECT SUM(amount) FROM 'sales.parquet' WHERE date > '2026-01-01'")

Related Terms

Understanding Columnar Storage in Apache Parquet

Practical Examples of Using Apache Parquet

Integration of Apache Parquet with Data Processing Frameworks

Learn more

Frequently Asked Questions

Parquet vs CSV?

CSV: human-readable, row-oriented, no schema, no compression. Parquet: binary, columnar, schema embedded, 10-100x smaller. For analytics — always Parquet.

Parquet vs ORC?

Similar columnar formats. ORC: Hive ecosystem history, better indexing. Parquet: broader adoption (Spark default). 2026 — Parquet wins.

Typical size?

CSV 1 GB → Parquet 50-200 MB with Snappy compression (5-20x). ZSTD gives 2-3x better but at CPU cost.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.