Apache XTable. Delta vs Iceberg vs Hudi.
The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations:
1. What is Apache XTable?
- Not a New Format: It’s explicitly not another “table format.” Instead, it translates existing Lakehouse table metadata so that one physical dataset can be recognized as Delta, Hudi, or Iceberg.
- “Omni-directional Interop”: The goal is to let you read/write the same physical data files from any of the three major table formats without duplicating data.
2. How It Works at a High Level
- Reads your existing metadata (e.g., Delta log or Hudi
.hoodie
folder). - Generates metadata for the other table formats (e.g., it creates an
metadata/
folder for Iceberg or.hoodie/
folder for Hudi). - No Physical Copy of the actual data files—only the metadata gets duplicated or converted.
- Query using the familiar
spark.read.format("delta|hudi|iceberg").load("...")
syntax. In theory, you can point to the same data location and pick a format.
3. Setup and Configuration
- Requires Java 11: The project uses Maven, Java, and has various version constraints (like for Spotless, a code formatting plugin).
- Configuration via YAML:
- Command-Line Invocation:
4. Building from Source (Pain Points)
- Maven & Java Versions: The blog highlights issues with Java 17 or later vs. Java 11, causing Spotless or plugin mismatches.
- Dockerfile: The provided Dockerfile in the project apparently had syntax or versioning issues. I had to create a custom Dockerfile with Java 11 + Maven to build the project successfully.
5. Trying It Out
- I tested XTable with:
- A Delta Lake table in S3 (both a Unity Catalog–managed table and a regular “unmanaged” table).
- Ran the XTable conversion to produce Hudi (
.hoodie/
) and Iceberg (metadata/
) folders.
- Reading the Converted Tables:
- Databricks Spark environment often complained about “overlapping” or “incorrect path” errors when trying to read the newly created Iceberg/Hudi metadata from the same base directory.
- I had better luck with Polars locally, though they had to point directly to the Iceberg metadata JSON file (e.g.,
v2.metadata.json
), rather than just the parent directory.
6. Current Observations
- Early-Stage or “Incubating”: This is an incubating Apache project, so friction and bugs are expected:
- Building can be finicky.
- Reading with Spark or other engines may require extra steps or fail silently.
- Docs are still sparse, especially around whether (and how) incremental sync is supposed to work in production.
- Daily/Continuous Sync?: XTable offers “incremental” and “full” sync, implying you might need to re-run these conversions regularly to keep each format’s metadata up to date.
- I find the concept worthwhile—XTable could solve real problems for teams who need to maintain multiple table formats without duplicating large data sets.
- However, the execution is not yet smooth:
- Confusing build process.
- Some docs are missing or incomplete about how to read the newly created metadata in Spark or other engines.
- Possibly incompatible or untested with certain Databricks or Unity Catalog configurations out of the box.
Apache XTable is a promising idea—letting you keep data in one physical location but read it as Delta, Hudi, or Iceberg. However, it is still rough around the edges: building requires exact tooling, reading the converted metadata can fail depending on how you do it, and the documentation leaves gaps. If you’re interested in cross-format interoperability and you’re comfortable with Java-based incipient open-source tools, it’s worth experimenting with—but it’s probably not production-ready in its current form.