The Abstractions Are Making You Dumb (rise of the Shallow Expert)
When I was young and full of myself, writing Perl and PHP, while your ma was still reading you a bedtime story and giving you a stuffy to fall asleep with, I had to program uphill, both ways, in the rain and snow. Not like you milk toast Data Engineers clickty clicking around Databricks and Snowflake UIs.
You want a server? Spin up your own Apache. Need a database? MySQL was the only game in town. Need a backend language? Perl was the cat’s meow.
Need some Business intelligence and Analytics? Go ahead, set up your own servers and install SAP yourself, make sure to check those firewalls and open ports first. The days when stored procedures reigned supreme. Microsoft Access was the go-to tool for the business. You better know how to write some VBA back in those days, its nasty fingers were everywhere.
I still remember those days when the C# devs walked around with their nose in the air, whispering to each other about MVC. Then there were the Database Developers, only to be outdone by the Database Administrator, some old coot who had more Microsoft Certifications than an Eagle Scout and was just as proud.
The Rise of the Modern Data Engineering and Abstraction.
Of course, I’m speaking a little tong and cheek, just a little, but there is some truth rattling around in the bottom of the can somewhere.
Every single new cohort of Software Engineerings, Programmers, and Hackers released into this world builds upon and relies on their ancestors who came before them, who built the tools they use, upon which they build ever newer tools.
It’s the way, yet now that I have become that senile and grumpy old and irrelevant coder, I feel the same pangs and twinges of bitterness toward all those bright-eyed new-comers that proverbially skip their way up and down the Data Stack with zero appreciation of what came before.
I mean … all they do is open a Notebook on Databricks and crunch 50TBs of data with Python or SQL, yawn, and go back to their Snapchats and Instagram accounts.
And I wring my hands in desperation, don’t they know what we used to go through? Could they even take a set of bare metal servers and configuring a real-to-goodness Spark Cluster with all the configuration that entails? Those little hobbits couldn’t do it, I tell you.
I fear the abstractions have made us vulnerable.
Things have changed so much over the years, the new Data Engineers simply never encounter the problems the old ones have, and sometimes it shows. When things break and someone has to debug and fix them, that’s when the rocks start to come out of the water.
When you only know how to write SQL on Snowflake and that’s it … but then something breaks … something real … are you lost? I have seen that the struggle is real for some … to do simple and basic things like …
- SSH into a remote machine
- do basic BASH things
- Understand IPs/Ports and Networking generally
- Database Fundamentals and Data Modeling
- How Distributed Systems Work
- Better Than Average Programming
- Understanding ALL Clould Offers (things like Lambda, etc.)
- End-to-end and Integration Testing
- Docker
- CI/CD and DevOps Automation
- Talking to People
I mean we could keep going forever, but what I see now in the industry is a lot of what I would call Shallow Experts. They typically might know a lot on the surface about a particular service … BigQuery, Redshift, Snowflake, or Databricks. They stay in their lane, how they do it is all they know.
They don’t work on the command line, they have trouble solving performance problems at scale because of a lack of knowledge of “real” Big Data topics like data partitioning, skew, cluster configuration, etc. They typically don’t know what good clean Data Engineering code looks like, have no tests or any idea how to automate deployments.
If they had to generate keys, setup a CLI, and login to a remote machine, check some logs … something like that … it would just be so out of character and the skillset of a Shallow Expert … they simply could not do it without assistance … yet they might have the title of Senior Data Engineer.
Maybe all we need is Shallow Experts?
This could be a valid argument for some teams. Maybe they have large teams of Platform Engineers and Software Engineers who setup the Data Platforms, are in charge of the DevOps stuff, etc, so all they need is mediocre Shallow Experts to write some DBT pipelines with a little raw SQL sprinkled in.
That’s it.
The problem is you get what you pay for in many senses. At some point, a room full of Shallow Experts will catch up with you. Even beautiful and wonderfully abstracted platforms like Databricks and Snowflake cannot hide short-sighted and bad decisions that build up over long periods of time.
You can’t hide behind a tool forever.
That’s my rant for the day. So, I challenge you, my friend, are you a Shallow Expert?? Do something about it. Do something hard. Learn something you don’t know, look behind the onion layer of your favorite open-source project, how was it written, how did they do it?
How do distributed systems work? Do you think you could setup a Spark Cluster … even inside Docker? Do you know how to use Docker? Can you move around the Linux command line? What is a Cloud Service you have never used before? Use it, read about it.
Write SQL all day long? Write some code. Write code all day long? Write some SQL.
Enjoyed the rant! I have not been doing this job long but my fear is that I will become a shallow engineer; luckily my curiosity doesn’t know many bounds and have already started to investigate on my own some of the topics you have brought up, I’m constantly looking to understand the atomic elements that our systems are run and supported on, despite friends and colleagues telling me it’s unnecessary.
Outside of investigating, practicing, and jumping into different skill sets, the answer is probably no, but is there a recommended path to understanding these elements? Python is my first language and I’ve started learning C to walk through building an OS, but I’m curious from your experience if there’s “proper” path from base to abstraction, like Richard Feynmann’s lecture but CS/Data?
Either way appreciate your perspective!