Developing Production Level Databricks Pipelines.
A question that comes up often … “How do I develop Production Level Databricks Pipelines?” Or maybe someone just has a feeling that using Notebooks all day long is expensive and ends up being an unreliable way to produce Databricks Spark + Delta Lake pipelines that run well … without error.
It isn’t really that hard and revolves around a few core ideas.
- You must have a good Development Lifecycle
- Local Development
- Local Testing
- Deploy to Development Environment
- CI/CD
- Testing
- Deploy to the Production Environment
- CI/CD
- You need to use Docker and Docker-compose
- With Spark and Delta Lake installed + whatever else.
- Run code locally and unit test locally.
- You need to invest in CI/CD and auto testing and deployments
- Nothing should be done manually, the entire process automated
- Learn bash and things like CircleCI or GitHub actions
Watch below video for the full details.
Hey great video!
What’s your opinion on running notebook tests against production tables (using less rows) as part of the CI?
Best practices! Same ideas here on this template project, but building a Python env instead of docker… https://github.com/andre-salvati/databricks-template
Great video. When do you publish the article mentioned in the video?
It’s will be released on The Seattle Data Guys Substack newsletter soon. https://seattledataguy.substack.com/, I do many colabs with him, he recently published my Unity Catalog piece, this other one should be coming soon.