Cloudflare R2 Storage with Apache Iceberg
Rethinking Object Storage: A First Look at Cloudflare R2 and Its Built‑In Apache Iceberg Catalog
Sometimes, we follow tradition because, well, it works—until something new comes along and makes us question the status quo. For many of us, Amazon S3 is that well‑trodden path: the backbone of our data platforms and pipelines, used countless times each day. If it vanished, our entire workflow would skid to a halt—and cost us a fortune in the process.
Enter Cloudflare R2. I’d been meaning to kick the tires on R2 for a while, but Cloudflare recently added a feature that snapped me upright: a fully managed R2 Data Catalog for Apache Iceberg baked right into each bucket. Suddenly, “S3‑compatible storage” is only half the story—now we have an integrated Iceberg catalog with zero egress fees.
Below is a practical tour of R2: what it is, how it compares to S3, and how ridiculously easy it is to spin up an Iceberg table that you can query from Spark, PyIceberg, Snowflake, or Daft on Databricks.
What Is Cloudflare R2?
“Cloudflare R2 is a global object‑storage service that’s S3‑API compatible and charges no egress fees.” — some helpful AI bot
Compatibility with the S3 API is the headline feature. It means your existing code, your team’s muscle memory, and your data‑migration tooling all work out of the box. No funky SDKs, no re‑writing wrappers—just switch the endpoint and credentials.
Pricing (The Straightforward Version)
Cloudflare divides operations into two buckets:
Class | Typical Use | Price Point* |
---|---|---|
A | State‑changing ops (PUT, DELETE, etc.) | Higher |
B | Reads (GET, HEAD, LIST) | Lower |
For the Infrequent Access tier, you pay a retrieval fee, but egress bandwidth is free—a sharp contrast to S3’s often‑opaque data‑transfer charges. If you’ve ever tried to decode S3’s pricing matrix, R2 feels refreshingly simple.
*Check Cloudflare’s docs for current rates—pricing can shift.
Enabling the R2 Data Catalog (Iceberg)
- Install Wrangler (Cloudflare’s CLI).
brew install node npm create cloudflare
- Create or choose an R2 bucket.
- Turn on the catalog for that bucket:
npx wrangler r2 bucket catalog enable my‑iceberg‑bucket
You’ll get a response containing:
- Catalog URI – the REST endpoint you’ll point Iceberg clients at.
- Warehouse name – your logical namespace.
Create an API token with the “R2 Data Catalog” permission, and you’re done. The whole process takes minutes—and it actually worked first try.
Creating an Iceberg Table on R2
With the catalog enabled, any Iceberg‑aware engine can create tables. Below is an abbreviated Databricks + Daft example that uses the Backblaze hard‑drive dataset:
from daft import DataFrame
import pyiceberg
catalog_uri = "https://catalog.cloudflarestorage.com/<account>/<bucket>"
warehouse = "<account>_<bucket>"
token = "<API_TOKEN>"
# Configure Daft (Spark is equally straightforward)
os.environ["PYICEBERG_CATALOG_R2_TYPE"] = "rest"
os.environ["PYICEBERG_CATALOG_R2_URI"] = catalog_uri
os.environ["PYICEBERG_CATALOG_R2_TOKEN"] = token
os.environ["PYICEBERG_WAREHOUSE"] = warehouse
# Load CSV → write to Iceberg table
df = DataFrame.read_csv("s3://backblaze‑datasets/hard‑drives/*.csv")
df.write_iceberg("r2.my_iceberg.hard_drives")
Reading the table is just as painless:
iceberg_df = DataFrame.read_iceberg("r2.my_iceberg.hard_drives")
iceberg_df.show()
No exotic configuration, no storage‑layer hacks, no unexpected egress bills when Databricks pulls the data—just smooth sailing.
Final Thoughts
Cloudflare’s R2 isn’t “an S3 killer” in the sense that you’ll ditch AWS overnight, but it is a compelling alternative:
- S3‑API compatibility means near‑zero migration friction.
- Zero egress fees simplify cost models—especially for multi‑cloud analytics.
- Integrated Iceberg catalog removes yet another piece of glue code from your stack.
Pair R2 with engines like Spark, PyIceberg, or Daft, and you’ve got a production‑ready lakehouse that you can spin up almost as quickly as you can read this sentence. Competition is good for the ecosystem, and Cloudflare just raised the bar.
Have you tried R2 in your own pipelines? I’d love to hear how it stacks up against S3 in real‑world workloads—drop your thoughts in the comments.