Data Lake — хранилище raw data (любой формат — JSON, CSV, Parquet, Avro) на object storage (S3, GCS, HDFS). Schema-on-read. Data Warehouse — structured data, optimized для analytics queries (Snowflake, BigQuery, Redshift). Schema-on-write. Lakehouse (Databricks, Iceberg) — гибрид: raw storage + warehouse-style query engine.
Ниже: подробности, пример, смежные термины, FAQ.
-- Snowflake (Data Warehouse) — structured SQL
SELECT date, country, SUM(revenue)
FROM orders
WHERE date >= '2026-01-01'
GROUP BY 1, 2;
-- Data Lake + Iceberg + Trino (Lakehouse) — same SQL
-- over Parquet files in S3
SELECT date, country, SUM(revenue)
FROM iceberg.prod.orders
WHERE date >= DATE '2026-01-01'
GROUP BY 1, 2;Lake: raw / semi-structured, petabytes, ML training data. Warehouse: structured, BI dashboards, fast queries. Lakehouse — универсальный компромисс.
Lake S3: $0.02/GB/мес. Warehouse Snowflake compute: $2-4/credit. Для 10 TB: Lake ~$200 storage, Warehouse ~$4k storage+compute monthly.
Lake без governance = swamp. Solutions: data catalog (AWS Glue, Unity Catalog), Iceberg для ACID transactions, schema evolution.