Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

Read more

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

Read more

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the problems. I’ve always approached Spark performance with an overly pragmatic approach. Spark being the beast that it is, it’s easy to hide performance problems with more resources etc. I’ve generally tried to stay away from UDF's just using good coding practices and out of the box functionality. Ensuring good predicate pushdown’s, data partitioning etc are all helpful and important. But in the end… I don’t really know much about the out-of-the-box Spark configurations and how they affect performance.

Do the configurations change based on data size and partitioning strategy plus resources and cluster size? Probably. Does that seem complicated to figure out? Yes. Is the internet full of conflicting, vague and confusing advice? Of course.

Read more

There are many a day when I find myself scrolling through the subreddit for r/dataengineerg, it’s a fun place to stalk. Lot’s of people with lots of opinions make for interesting times. I see one question or a variation of it come up over and over again. How do I learn data engineering skills, how do I get into data engineering, what kind of problems do data engineers solve, blah, blah, blah? It’s a great question, and one without an easy answer. Well … there is an answer but it takes some time and willpower to get it done. Open source data. This is the way. Read books, take classes, do whatever, it’s hard to really learn the skills needed day-to-day as a data engineer without actually doing the work. But how do you do the work without the work? Make up your own work I say.

Read more

What is the standard for most data engineers these days? Turns out SQL and Python are still running the show pretty much across the board. There’s always a variety of skills in those areas, some better, some worse, although with a little work and repetition it’s pretty easy to master both SQL and Python. I’ve found that Python and SQL … or Java … or Scala … having good development skills is really only half the battle. It seems there is always a few basic data engineering skills that come up over and over. They are simple skills, foundational skills that allow an average data engineer to be better. They make a person more versatile and able solve more complex problems and work across a wide variety of of tech stacks and cloud providers. What are they? Read on my fair weathered friend.

Read more

Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat files are still incredibly popular and common for storing and exporting data between systems. Hence string work comes upon us all like some terrible overload. The one you should fear the most is fixed width delimited files. I ran into a problem recently where PySpark was surprisingly terrible at processing fixed with delimited files and “string slicing.” It got me wondering … is it me or you?

Read more

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

Read more

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and questions. In one sense Databricks and Snowflake are totally different tools … but are they? Distributed big data processing, apply transforms to data, enable Data Lake / Data Warehouse / Analytics at scale. There is a lot of bleed over between the two, it really comes down to what path you would like to take to get to the same goal.

Read more

The two coolest kids in class … I mean seriously … every other post in Data Engineering world these days is about Apache Airflow or DataBricks. It’s hard to kick against the goad. Just jump on the band wagon before you get left in the dust. I’ve used both DataBricks and Apache Airflow, they both are pretty important and integral tools for data engineers these days. Apache Airflow makes overall complex pipeline dependencies, orchestration, and management intuitive and easy. DataBricks has delivered with AWS and EMR could not, easy to use Spark and DeltaLake functionality without the management and config nightmares of running Spark yourself.

Recently I worked on an Airflow and DataBricks/DeltaLake integration, time to talk what it looks like and options when doing this type integration.

Read more

Apache Druid, kinda like that second cousin you know about … but don’t really know. When you see them for the first time in 10 years you kinda look at them out of the corner of your eye. That’s how I feel about Apache Druid, I’ve always known it has been there, lurking around in the shadows, but it rarely pokes it head out and I have no idea what, why, how it is used. Time to change that, for the better or worse. Let’s take 10,000 foot survey of Druid.

Read more