Apache Parquet — columnar storage format developed by Twitter + Cloudera (2013), ASF top-level project. Default format for analytics on a data lake. Benefits: 10-100x compression vs CSV, column pruning (read only needed columns), predicate pushdown (filter applied in reader). Write Spark/Pandas/DuckDB, read anywhere.
Below: details, example, related terms, FAQ.
# Python (pandas)
import pandas as pd
df = pd.read_csv('sales.csv')
df.to_parquet('sales.parquet', compression='snappy')
# Read only needed columns (column pruning)
df = pd.read_parquet('sales.parquet', columns=['date', 'amount'])
# DuckDB — query parquet directly
duckdb.sql("SELECT SUM(amount) FROM 'sales.parquet' WHERE date > '2026-01-01'")CSV: human-readable, row-oriented, no schema, no compression. Parquet: binary, columnar, schema embedded, 10-100x smaller. For analytics — always Parquet.
Similar columnar formats. ORC: Hive ecosystem history, better indexing. Parquet: broader adoption (Spark default). 2026 — Parquet wins.
CSV 1 GB → Parquet 50-200 MB with Snappy compression (5-20x). ZSTD gives 2-3x better but at CPU cost.