Developing Production Level Databricks Pipelines.

A question that comes up often … “How do I develop Production Level Databricks Pipelines?” Or maybe someone just has a feeling that using Notebooks all day long is expensive and ends up being an unreliable way to produce Databricks Spark + Delta Lake pipelines that run well … without error.

It isn’t really that hard and revolves around a few core ideas.

You must have a good Development Lifecycle
- Local Development
- Local Testing
- Deploy to Development Environment
  - CI/CD
  - Testing
- Deploy to the Production Environment
  - CI/CD
You need to use Docker and Docker-compose
- With Spark and Delta Lake installed + whatever else.
- Run code locally and unit test locally.
You need to invest in CI/CD and auto testing and deployments
- Nothing should be done manually, the entire process automated
- Learn bash and things like CircleCI or GitHub actions

Watch below video for the full details.

May 15, 2024

4 replies

Costas says:
May 16, 2024 at 1:12 am

Hey great video!

What’s your opinion on running notebook tests against production tables (using less rows) as part of the CI?
Andre says:
May 17, 2024 at 4:08 pm

Best practices! Same ideas here on this template project, but building a Python env instead of docker… https://github.com/andre-salvati/databricks-template
John says:
May 21, 2024 at 7:47 pm

Great video. When do you publish the article mentioned in the video?
- Daniel says:
  May 25, 2024 at 12:48 am
  
  It’s will be released on The Seattle Data Guys Substack newsletter soon. https://seattledataguy.substack.com/, I do many colabs with him, he recently published my Unity Catalog piece, this other one should be coming soon.

Developing Production Level Databricks Pipelines.

Comments are closed.

Interesting links

Pages

Categories

Archive