Do this, not that. Well, I’ve got my own list. With everyone jumping on the PySpark / Databricks / EMR / Glue / Whatever bandwagon I thought it was long overdue for a post on what to do, and not to do when working with Spark / PySpark. I take the pragmatic approach to working with Spark, it’s honestly very forgiving well and far into the 10s of TBs of data. Once you wander past that point things tend to get a little spicy if you don’t have it all dialed in. As with most things in life if you get a few things right, and of course don’t do some things, that will get you a long way, the same applies to Spark.
Read moreEver since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…
“Data reliability testing for SQL- and Spark- accesssible data.“
soda-core docs
Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.
Read moreI’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.
I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.
This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang
, Rust
, and Python
. I will let you draw your own conclusions. Maybe. Code available on GitHub.
Just when I think it cannot get more popular, it does. I have to admit, PySpark is probably the best thing that ever happened to Big Data. It made what was once a myth, approachable to the average person. No need for esoteric Java skills, no more MapReduce, just plain old Python. Another amazing thing about Spark in general, and by extension PySpark, is the sheer amount of out-of-the-box capabilities. I wanted to dedicate this post to a few amazing and wonderful features of PySpark that make Data Engineering fun and powerful.
Read moreIt still seems like the wild west of Data Quality these days. Tools like Apache Deque are just too much for most folks, and Data Quality is still new enough to the scene as a serious thought topic that most tools haven’t matured that much, and companies dropping money on some tool is still a little suspect. I’ve probably heard more about Great Expectations as a DQ tool than most.
With the popularity of PySpark as a Big Data tool, and Great Expectations coming into its own, I’ve been meaning to dive into what it would actually look like to to use Great Expectations at scale and answer some simple questions. How easy is it to get up and running with Spark, what’s the path of least resistance to getting some basic Data Quality checks in place in a data pipeline.
Read moreProbably one of the hardest hurdles to jump over when starting out in anything new, including Data Engineering and Data Pipelines, is knowing where to start. It always can be a little daunting. One aspect that can make or break any project, giving you the confidence to move forward like Sparticus to conquer, is having a good project template for your repository of code and logic that will encapsulate and present your code to others.
I’ve created a free and hopefully helpful Python
blank GitHub project template that you can clone, change, and steal to your heart’s desire. I hope it will be helpful and set you going in the right direction for your next project.
Not going to lie, I do enjoy the vendor wars that this marketing craze called “The Modern Data Stack” has created. I like to keep just about everything in life at arm’s length. Kinda like the way you look at your crazy third cousin out of the corner of your eye at the family reunion. I mean it’s nice to have all these options to choose from these days when building data pipelines.
One tool I haven’t been able to poke the tires on yet is Prefect. It appears to be another data orchestration tool for Python, but we shall find out. I want this to be an introduction to Prefect, we shall just try it out and let the chips fall where they may.
Read moreAs the years drag by in Data Engineering, there are a few things that I have come to appreciate more and more. One of those topics that is close to number one on the list is complexity reduction. Today’s modern data stacks are filled to the brim with technologies and tools, full to the brim, and overflowing. So many tools with such wonderful features, sometimes all the magic comes with a downside. Complexity. Complexity can turn something wonderful into a nightmare.
Reducing (not avoiding) complexity seems to be one of the main tenets I work on these days when designing resilient, reliable, and repeatable data pipelines that can process terabytes of data. One of those tools is COPY INTO
feature of Databricks + Delta Lake.
Mmmm … Data Quality … it is a thing these days. I look forlornly back to the ancient days of SQL Server when nobody cared about such things. Alas, we live in a different world, where hundreds of terabytes of data are the norm, and Data Quality becomes a thing. I’ve been meaning to give Great Expectations a poke for like a year, but just haven’t had the time or inclination to do so, but times are changing, and so should I.
I’m not really planning on giving an in-depth guide to Data Quality with Great Expectations, what I’m more interested in are topics like, how easy is it to set up and use, what’s the overhead, what are the main features and concepts and are they easy to understand. I find this sort of review of Data Engineering tools to be more helpful than simply a regurgitation of the documentation.
Read moreIt’s not often I yearn for the good old days of SQL Server
, but I’ve had a few of those moments lately. Some things I miss, some I don’t, and it’s probably because I’m getting old and crusty, stuck in my ways, by permissioning
is one of those topics where I think about the good old days. Data access control and permissions are topics that we all kinda ignore as not that important … until we actually are trying to do something with them. Then all of sudden we start complaining about complexity and why this isn’t easier. That’s a good way to describe my reaction to having to work on Databricks Access control.
But, I learned a few things and I think they will be helpful for someone. Read on for the basics of handling permissions and access control in Databricks and Delta Lake.
Read moreInteresting links
Here are some interesting links for you! Enjoy your stay :)Pages
Categories
Archive
- January 2025
- December 2024
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- May 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018