A Tale of Betrayal and Heartbreak – Databricks Workflows and Jobs.
Nothing captures the imagination and heart like a tale of betrayal and heartbreak, and that is a tale I want to bring to you today. It’s a tale of Databricks Workflows and Jobs, version changes, new features, API’s, and insidious little hidden gems that will make you pull your hair out when you find them. It’s a tale of what not to do, a tale of how to put developer and customer experience first, instead of forcing unwanted solutions down the throats of the little birdies feeding at your nest.
As a Data Engineering simplicity and ease of use is something close to my heart, something that Databricks did well, or maybe I should say used to do well … before recent releases like Jobs 2.1 API.
I hope you can hear the bitterness oozing from my words.
Once Upon a Time. Setting the Stage.
Let us set the stage for this grant tale of betrayal and heartbreak. There once was a Databricks who made wonderful tools that reduced the complexity of ML and Big Data pipelines in amazing ways, much to the chagrin of EMR and other such silly tooling. The ease at which one could spin up massive Spark clusters, crunch data, and use Delta Lake was truly a dream come true.
There were such wonderful API
s at every turn. API
s for everything under the sun you would want to do with Databricks. But, there was a problem. Databricks can be expensive. Spark costs money to run. Naughty people liked using Notebooks and All-Purpose-Clusters
too much.
But, there was a saving grace, Databricks Jobs. These nice little buggers were half the cost of its acky cousin Mr. All-Purpose Cluster and his mean little child Naughty Notebook.
This gave rise to two different types of Databricks users, architectures, and interactions on Databricks. Each always fighting the other and battling for power.
In one corner you had the folks that don’t care about money. It’s free money. It’s not my money. Spend, spend, spend. Notebooks, upon Notebooks, upon Notebooks. All-Purpose-Cluster’s running amuck, churning and burning data. Development on a cluster with a few TB
s of RAM? Why not? It makes the code run fast. Shouldn’t you simply develop at scale?
And then of course on the other side, you had Job APIs that enabled the cheap and quick running of Spark Jobs in a cost-effective and programmatic way. This type of development that folk are supposed to be doing. Writing well-formed and tested Spark code, having good orchestration and submission of that code to Databricks, and monitoring.
But, this created a conundrum and identity crisis for Databricks. Which one should they support, and where should the development time go? The crazy All-Purpose-Cluster
and Notebook
world lines the pockets like nothing else. That’s where the money is spent.
The Betrayal.
Do I really need to tell you which one they picked? Haven’t you figured it out already??!! Can’t you feel my anguish through these digital pages?? Like most stories of betrayal, it started slowly creeping in, not all at once. And it started with two things.
- Databricks Workflows
- Jobs 2.1 API
They made their choice. Enter Databricks Workflows and Jobs 2.1
API. And there it was, the inevitable betrayal of all those trying to do the right thing. Trying to develop well, save money, and keep simplicity forefront. All out the window.
How you ask? How have we all been betrayed and thrown upon the pyre and sacrificed to the Notebook Gods and their teaming masses? I will tell you how.
My old friend was once called JobsRunSubmit
. It even had a most popular cousin called that was an Airflow Operator to call such an API.
“Submit a one-time run. This endpoint allows you to submit a workload directly without creating a job.“
– Databricks
This useful little blighter was most useful. Submit a Spark Job
that would run instantly on a new transient Job cluster (that would disappear after the job run). So easy to use and orchestrate, so cheap and wonderful.
That most wicked Workflow came and ruined everything.
Of course, Databricks wanting everyone to stop using horrible tools (eye roll) like Airflow, introduced those Workflows and with it, Jobs 2.1
API. With nice new features like multi-task
Jobs
and cluster
reuse!!! Think of the wonderful cost savings of submitting a transient Spark job with multiple tasks/steps to all run on the same Cluster! Oh, a wonder to behold.
Our hearts were singing, there was dancing and merriment in the halls above. Until we realized it was the cruelest and most evil trick. As can be found in the following documentation and API responses.
“Transient single spark run submit jobs are not supported for multi-task format.”
– Databricks
“Spark submit is not supported on shared job clusters.”
– Databricks
Long story short, they are forcing us, every so slightly, but surely over toward using persistent Jobs and Workflows. Also, making it harder and way more complicated, or almost impossible to keep simplicity at the forefront with simple transient API Jobs.
Fine! Want to use your old wonderful simple way with an existing All-Purpose-Job-Cluster and spend more money because you’re forcing us to??
b'{"error_code":"INVALID_PARAMETER_VALUE","message":"Spark submit task is not allowed on existing cluster."
Well, you little hobbits. What if you have large Python zip
files of custom code you easily want to send to the cluster?? Make me do a .whl
or .jar
file will you??!!
Can I work around this? Sure. Do I have time to? No. Are they making it extremely difficult from multiple angles and adding a ton of complexity to enable the ease of use of new wonderful features in a cost-effective and easy manner as before? Yes. Have they made it clear that the future of Databricks is focused on spending money with All-Purpose-Clusters, Workflows, Notebooks, and other such come and do everything with us and spend all your money on our wares? Yes.
Is it the end of the Databricks honeymoon?
Yes.
At my last job, I wrote a python library that you can call from your notebook that will run your notebook as a job using the 2.1 api. It will clone whatever active all-purpose cluster the notebook is attached to or you can pass your own config. I don’t work there anymore so I no longer maintain it but it should still work.
https://github.com/gardnmi/all_purpose_bypass