Apache Parquet

Anatoly Oshmanovsky

By Anatoly Oshmanovsky · Updated May 23, 2026

Key idea:

Apache Parquet — columnar storage format developed by Twitter + Cloudera (2013), ASF top-level project. Default format for analytics on a data lake. Benefits: 10-100x compression vs CSV, column pruning (read only needed columns), predicate pushdown (filter applied in reader). Write Spark/Pandas/DuckDB, read anywhere.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

Columnar: data stored by columns, not rows
Compression: Snappy, Gzip, ZSTD, LZ4 per column
Encoding: dictionary, RLE, bit-packing, delta for efficient storage
Predicate pushdown: WHERE date > 2026-01-01 at reader level, not scan
Schema embedded in file → self-describing

Example

# Python (pandas)
import pandas as pd
df = pd.read_csv('sales.csv')
df.to_parquet('sales.parquet', compression='snappy')

# Read only needed columns (column pruning)
df = pd.read_parquet('sales.parquet', columns=['date', 'amount'])

# DuckDB — query parquet directly
duckdb.sql("SELECT SUM(amount) FROM 'sales.parquet' WHERE date > '2026-01-01'")

Related Terms

Understanding Columnar Storage in Apache Parquet

Practical Examples of Using Apache Parquet

Integration of Apache Parquet with Data Processing Frameworks

Learn more

Glossary

Frequently Asked Questions

Parquet vs CSV?

CSV: human-readable, row-oriented, no schema, no compression. Parquet: binary, columnar, schema embedded, 10-100x smaller. For analytics — always Parquet.

Parquet vs ORC?

Similar columnar formats. ORC: Hive ecosystem history, better indexing. Parquet: broader adoption (Spark default). 2026 — Parquet wins.

Typical size?

CSV 1 GB → Parquet 50-200 MB with Snappy compression (5-20x). ZSTD gives 2-3x better but at CPU cost.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing