Data Engineering Archives - Page 12 of 23 - Confessions of a Data Guy

8 Data Engineering Best Practices

Best practices are always a touchy subject, I’m going to forget someone’s pet best practice, I can already feel it. I’ve always been a firm believer in the basics, keeping things simple. I also ascribe to the 80/20 rules, and I don’t think Data Engineering is any different in that respect. Learning to do a few things well, in the long run will probably solve most of your major problems encountered in data teams and architectures. Today I want to give you 8 Data Engineering best practices to hopefully give you some food for thought at least.

September 21, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Soda-Core. Data Quality at Scale.

Ever since playing with Great Expectations with Spark some time ago, I’ve been on the lookout for more Data Quality at-scale tools. The market still has a long way to go with these tools, not enough options, hard to use, and the typical Data Engineering travails. I came across soda-core recently, a self-proclaimed…

“Data reliability testing for SQL- and Spark- accesssible data.“
soda-core docs

Doing anything at scale, well … that’s usually the problem. Data Quality and Observability are topics were hear a lot about these days. The reality often doesn’t meet the expectations most of the time. Even Great Expectations, being awesome, can get complicated real quick-like. Let’s hope that soda-core pair with Spark can show us some real promise. Code available on GitHub.

September 14, 2022

Big Data, Data, Data Engineering, Data Warehousing, Rust

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

I think it’s funny that DataFrames are so popular these days, I mean for good reason. They are a wonderful and intuitive way to work with and on datasets. Pandas … the nemesis of all Data Engineers and the lover of Data Scientists. Apache Spark is really the beast that brought DataFrames to the masses. Even those little buggers over at Apache Beam give you DataFrames.

Of course, when anything gets popular, you start getting little things that start to pick and peck at the heels. I would probably say that is what DataFusion with Rust seems to be. Seems more like a contender against Pandas rather than Spark to me. I guess if you’re just using Spark locally or on a single node, sure you could consider using DataFusion. Code available on GitHub.

September 11, 2022

Data, Data Engineering, Golang, Python, Rust

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

I’ve always been a firm believer in using the right tool for the job. Sometimes I look at a piece of code … and ask … why? I mean just because you can do something doesn’t mean that you should. I see a lot of my job as someone who writes code … as not just my ability to write code, but the ability to reason about problems and design simple and elegant solutions that solve the problem at hand.

I try not to let my love of a tool, language, or package color my view of the world as it is. In fact, there is wisdom to be found in being critical of those languages and tools you love the most. Be aware of their shortcomings and failures. This leads to better software and architecture designs, and less complexity. Too often I’ve seen folks picking their tool of choice and then sticking with it till the bitter end, and it usually is bitter. There is more to life than writing obtuse Scala code that is illegible for some mundane task.

This sort of thing is a blight on everyone and every system. Now I must descend from my high horse and join the peasants on the dusty road of life. Today I want to look at some very common Data Engineering tasks, namely cloud storage, and what it is like to do such a thing with Golang, Rust, and Python. I will let you draw your own conclusions. Maybe. Code available on GitHub.

September 10, 2022

Big Data, Data, Data Engineering, Ramblings

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

In the beginning, I always thought the humdrum Big O Notation discussions should be reserved for Software Engineers who enjoyed working on such things. I mean, what could it possibly have to do with Data Engineering? I mean, if you are the person writing the Spark application, by all means, have at it, but if you are the Data Engineer who is simply using Spark, why can’t you just leave the details to the Devil? Seems to make sense.

The only problem with that logic is the longer you work as a Data Engineer, probably the harder the problems you work on become, you write more and more code, and basically end up being a specialized Software Engineer … even if you don’t want to be. In the end, to be a good Data Engineer you should at least attempt to understand the concepts behind Big O Notation, and how those concepts can apply to you as Data Engineer, especially for the ETL that most of us write.

August 15, 2022

Data, Data Engineering, Data Warehousing

5 things I wish I knew about Databricks … before I started.

How many times in your life, that is but a mist, have you thought, “If I had only known that in the beginning?” I feel as if I’ve committed that cardinal sin as a developer and Data Engineer … falling in love with a tool to the exclusion of all else. I mean truly, Databricks has brought Big Data to the masses, all you need is your laptop and 10 minutes of PySpark training before your spending gobs of money, processing massive amounts of data. Where else, and with what else can you do such things? Try it with EMR, good luck to you.

That being said, when you love something you start to notice the slight imperfections and problems with that something. You get kinda nit-picky. Such is life. I want to save some poor soul out there some heartache, that moment when you’ve been writing code for hours or days, and come upon a little surprise that makes your heart drop into your shoes, and the blood runs to your face. Here are 10 things I wish I knew about Databricks before I started. Maybe it will save you time, help you, who knows.

August 5, 2022

Big Data, Data, Data Engineering, Data Quality, SQL

You Only Need 2 Data Validations, That’s It.

I mean, I’m sort of being facetious and sort of not. I mean there is some truth that rings out in those words. I’m sure someone selling Data Observability tools, or writing Great Expectations all-day will not like the idea of relying on only 2 data validations. But honestly, these two are probably more than 80% of Data Teams are using today for validation, which is none. What 2 are you? Glad you asked.

August 1, 2022

Data, Data Engineering, Golang, Ramblings, Rust

Thoughts on Saint Augustine, Rust vs Golang. Complexity, verbosity, and other matters.

**Image: *Saint Augustine of Hippo* | Line engraving by P. Cool after M. de Vos | Wellcome Images**

I’ve always enjoyed reading Mr. Augustine of Hippo, particularly “Confessions.” Ahead of his time in many ways. Although, you have to be into that sort of thing to find such topics interesting. It can be sort of dry, drawn out, verbose, and not for the faint of heart. Much like learning new programming languages. I’ve been messing with Golang off and on and here and there. Recently I added Rust to that list, more out of curiosity and to see what’s new in the world.

I’ve spent a lot of time thinking about the theology of programming in the space of Data Engineering. It’s such a wide area that encompasses so many different skills, Data Engineering that is. Why do we do what we do, write what we write? Like Augustine I see both old and new all around me, some things change, but many things stay the same.

People find hills like Python, Scala, Golang, Rust, and then promptly decide to die on them. I enjoy different things simply because of the way they teach you things about yourself and the world.

July 15, 2022

Big Data, Data, Data Engineering, Data Warehousing

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

I think Delta Lake is here to stay. With the recent news that Databricks is open-sourcing the full feature-set of Delta Lake, instead of keeping the best stuff for themselves, it probably has the most potential to be the number one go-to for the future of Data Lakes, especially within those organizations that are heavy Spark users.

One of the best parts about Delta Lake is that it’s easy to use, yet it has a rich feature set, making it a powerful option for Big Data storage and modeling. One of those features that promise a lot of performance benefits is something called ZORDER. Today I want to explore more in-depth what ZORDER is, when to use it, when not to use it, and most importantly test its performance during a number of common Spark operations.

July 7, 2022

Data, Data Engineering, Golang

Thoughts on HTTP and JSON with Golang. And other Headaches.

I’ve been playing with Golang off and on for a few weeks, when I find the time, which is every few weeks between kids and fishing. I have become a little bit of a fan, wishing for more projects to take on with Go. It seems like a fairly straightforward language to pick up, the learning curve isn’t that bad, and it’s fast and powerful. I’ve found it a little more intuitive than Scala for example. I mean don’t get me wrong, nothing will take the place of Python in my life, but there’s always room for one more.

That being said, “But I have this against you…” when it comes to Go, and it has to do with JSON. All code is on GitHub.

July 5, 2022

8 Data Engineering Best Practices

Soda-Core. Data Quality at Scale.

DataFusion courtesy of Rust, vs Spark. Performance and other thoughts.

Working with Cloud Storage (s3). Golang vs Rust vs Python. Who shall emerge victorious?

Real-Life Example of Big O(n) Notation (and other such nonsense) for Data Engineering.

5 things I wish I knew about Databricks … before I started.

You Only Need 2 Data Validations, That’s It.

Thoughts on Saint Augustine, Rust vs Golang. Complexity, verbosity, and other matters.

Exploring Delta Lake’s ZORDER, and Performance. On Databricks.

Thoughts on HTTP and JSON with Golang. And other Headaches.

Interesting links

Pages

Categories

Archive