, , ,

Apache XTable. Delta vs Iceberg vs Hudi.

The blog post reviews an Apache Incubating project called Apache XTable, which aims to provide cross-format interoperability among Delta Lake, Apache Hudi, and Apache Iceberg. Below is a concise breakdown from some time I spend playing around this this new tool and some technical observations:

1. What is Apache XTable?

  • Not a New Format: It’s explicitly not another “table format.” Instead, it translates existing Lakehouse table metadata so that one physical dataset can be recognized as Delta, Hudi, or Iceberg.
  • “Omni-directional Interop”: The goal is to let you read/write the same physical data files from any of the three major table formats without duplicating data.

2. How It Works at a High Level

  1. Reads your existing metadata (e.g., Delta log or Hudi .hoodie folder).
  2. Generates metadata for the other table formats (e.g., it creates an metadata/ folder for Iceberg or .hoodie/ folder for Hudi).
  3. No Physical Copy of the actual data files—only the metadata gets duplicated or converted.
  4. Query using the familiar spark.read.format("delta|hudi|iceberg").load("...") syntax. In theory, you can point to the same data location and pick a format.

3. Setup and Configuration

  • Requires Java 11: The project uses Maven, Java, and has various version constraints (like for Spotless, a code formatting plugin).
  • Configuration via YAML:
    yaml
    sourceFormat: DELTA
    targetFormats:
    - HUDI
    - ICEBERG
    datasets:
    -
    tableBasePath: s3://path/to/source/data
    tableDataPath: s3://where/you/want/the/data
    tableName: mytable
    namespace: my.db
  • Command-Line Invocation:
    bash
    java -jar xtable-utilities/target/xtable-utilities_2.12-0.2.0-SNAPSHOT-bundled.jar \
    --datasetConfig config.yaml \
    [--hadoopConfig hdfs-site.xml] \
    [--convertersConfig converters.yaml] \
    [--icebergCatalogConfig catalog.yaml]

4. Building from Source (Pain Points)

  • Maven & Java Versions: The blog highlights issues with Java 17 or later vs. Java 11, causing Spotless or plugin mismatches.
  • Dockerfile: The provided Dockerfile in the project apparently had syntax or versioning issues. I had to create a custom Dockerfile with Java 11 + Maven to build the project successfully.

5. Trying It Out

  • I tested XTable with:
    • A Delta Lake table in S3 (both a Unity Catalog–managed table and a regular “unmanaged” table).
    • Ran the XTable conversion to produce Hudi (.hoodie/) and Iceberg (metadata/) folders.
  • Reading the Converted Tables:
    • Databricks Spark environment often complained about “overlapping” or “incorrect path” errors when trying to read the newly created Iceberg/Hudi metadata from the same base directory.
    • I had better luck with Polars locally, though they had to point directly to the Iceberg metadata JSON file (e.g., v2.metadata.json), rather than just the parent directory.

6. Current Observations

  • Early-Stage or “Incubating”: This is an incubating Apache project, so friction and bugs are expected:
    • Building can be finicky.
    • Reading with Spark or other engines may require extra steps or fail silently.
    • Docs are still sparse, especially around whether (and how) incremental sync is supposed to work in production.
  • Daily/Continuous Sync?: XTable offers “incremental” and “full” sync, implying you might need to re-run these conversions regularly to keep each format’s metadata up to date.
  • I find the concept worthwhile—XTable could solve real problems for teams who need to maintain multiple table formats without duplicating large data sets.
  • However, the execution is not yet smooth:
    • Confusing build process.
    • Some docs are missing or incomplete about how to read the newly created metadata in Spark or other engines.
    • Possibly incompatible or untested with certain Databricks or Unity Catalog configurations out of the box.

Apache XTable is a promising idea—letting you keep data in one physical location but read it as Delta, Hudi, or Iceberg. However, it is still rough around the edges: building requires exact tooling, reading the converted metadata can fail depending on how you do it, and the documentation leaves gaps. If you’re interested in cross-format interoperability and you’re comfortable with Java-based incipient open-source tools, it’s worth experimenting with—but it’s probably not production-ready in its current form.