What it is
A data lakehouse is a data-platform architecture that adds the reliability, governance and performance features of a data warehouse on top of the cheap, flexible storage of a data lake. Instead of copying data between a lake for raw exploration and a separate warehouse for SQL reporting, a data lakehouse keeps a single copy of your data in low-cost object storage and makes it behave like a warehouse when you query it. As Chapter 1 of the AD7 Data Science study pack puts it, the lakehouse is the "unified modern platform" that supports both raw and structured data, and both schema-on-read and schema-on-write.
The breakthrough that makes this possible is the open table format. Underneath, data is stored as columnar Apache Parquet files sitting in object storage such as Amazon S3, Azure Data Lake Storage or Google Cloud Storage. On its own, a pile of Parquet files is just a data lake — no transactions, no guarantees. A table format layers a metadata and transaction log over those files so the engine sees a real, versioned table rather than loose objects. The three dominant open formats are Delta Lake (originally from Databricks), Apache Iceberg (originating at Netflix) and Apache Hudi (originating at Uber). All three are open source, all three turn object storage into something you can safely UPDATE, DELETE, MERGE and time-travel through.
How it works in practice
The mechanism rests on four building blocks working together.
1. Open file format. Raw data lands as Parquet (occasionally ORC or Avro) in object storage. This is the same cheap, scalable, vendor-neutral layer a plain data lake uses — you are not locked into a proprietary warehouse store.
2. Metadata and transaction layer. The table format maintains a transaction log (Delta Lake's _delta_log, Iceberg's manifest and snapshot files, Hudi's timeline). This log records every change as an atomic commit and tracks which files belong to the current version of the table. It is what lets multiple readers and writers operate on the same data without corrupting it.
3. ACID transactions on object storage. This is the headline capability. In Chapter 1, ACID — Atomicity, Consistency, Isolation, Durability — is presented as the guarantee that classically separated relational databases from lakes. The lakehouse brings those same guarantees to the lake: a write either commits fully or not at all (atomicity), concurrent jobs do not see each other's half-finished state (isolation), and a committed change survives a crash (durability). That is why you can run a reliable ELT MERGE or a GDPR "right to be forgotten" delete against a lakehouse table, something a bare data lake cannot do safely.
4. Schema enforcement and evolution. The metadata layer validates incoming data against the table schema (schema-on-write style enforcement) while still allowing controlled schema evolution — adding or renaming columns without rewriting history. This prevents the silent breakage Chapter 1 warns about, where one field change quietly breaks every downstream pipeline.
On top of these sit the familiar warehouse niceties: time travel (querying an earlier snapshot for audit or reproducibility), compaction of small files, partitioning, and fine-grained access control. Crucially, the lakehouse inherits the cloud-warehouse pattern of separating storage from compute — many engines (Spark, Trino, Flink, Snowflake, DuckDB) can read the same Iceberg or Delta tables, with workload isolation per team to avoid the "noisy neighbour" problem.
| Dimension | Data Lake | Data Warehouse | Data Lakehouse |
|---|---|---|---|
| Storage | Object storage, raw files | Proprietary optimised store | Object storage + open table format |
| Schema | Schema-on-read | Schema-on-write | Both (enforced, evolvable) |
| Data types | Structured + unstructured | Structured only | Structured + unstructured |
| Transactions / ACID | None (no guarantees) | Full ACID | ACID via the table format |
| Cost | Lowest (cheap storage) | Highest (storage + engine) | Low storage, pay-per-query compute |
| Typical use | Exploration, ML, big data | BI dashboards, SQL reporting | Unified BI + ML on one copy |
| Governance | Weak (risk of "data swamp") | Strong, mature | Strong (enforced schema, lineage, ACL) |
The EU angle is concrete. Open formats and open storage align directly with the European Data Strategy and the push for digital sovereignty — there is no proprietary lock-in to a single vendor's storage engine. An EU institution can publish and process the same datasets that feed data.europa.eu without duplicating them across systems, and can attach DCAT-AP metadata for discoverability. The columnar, engine-agnostic nature of a lakehouse also suits EuroHPC-style large-scale analytics, where many compute jobs need to read one shared, governed dataset reliably. For pipelines handling Eurostat statistics or ABAC financial records, the ACID guarantee is not a luxury — it is what makes the data trustworthy and auditable.
Common points of confusion
- A lakehouse is not "a data lake with a warehouse bolted on top." The whole point is that there is a single storage layer and single copy of the data. If you are still loading the same data from a lake into a separate warehouse for BI, you have two systems and two copies — that is the old two-tier architecture the lakehouse was designed to replace, not a lakehouse.
- A table format is not a file format. Parquet, ORC and Avro are file formats — they describe how one file's bytes are laid out. Delta Lake, Iceberg and Hudi are table formats — they describe how a collection of files behaves as a transactional, versioned table. Iceberg tables are usually made of Parquet files; the two layers are complementary, not alternatives.
- "Lakehouse" is an architecture, not a product. Databricks popularised the term, but a lakehouse is a pattern you can build with open-source components on any cloud. Delta Lake, Iceberg and Hudi are the formats; Spark, Trino, Flink and others are the engines. Do not equate the concept with one vendor's branded platform.
Why it matters for EU data scientists
For a working data practitioner, the lakehouse is now the default modern target architecture, so being able to reason about why you would choose it — one governed copy of data serving both ML feature pipelines and SQL dashboards, at object-storage cost — is a core engineering skill. You will meet it whenever you design ingestion, choose between ETL and ELT, or plan how analysts and data scientists share the same source of truth.
For the EPSO/AD/429/26 Field 4 (Data Science) competition, it sits squarely in Duty Area 1, Data Engineering. The exam tests principles and trade-offs, not syntax: expect scenario questions that hinge on the lake versus warehouse versus lakehouse distinction ("store raw sensor data and run ML on it" points to a lake; "fast SQL for monthly budget reports" points to a warehouse; "one platform that does both" points to the lakehouse), and on the fact that ACID on object storage is exactly what a lakehouse adds over a plain lake. Knowing that a table format, not a file format, delivers those transactions is the kind of confused-pair distinction EPSO loves to probe. Build that fluency with the full study pack: Prep for AD7 Data Science on Prep4EU