Data Engineering Archives - Page 16 of 23 - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python, Ramblings

The Wild West of Parallel Computing – Review of Bodo.ai

It truly is the Wild West of parallel computing these days. It seems that big data has brought out an onslaught of companies trying to either take advantage of making it easier to use any number of big data platforms or making up their own. Most of them usually take shots at tools like Spark and Dask, probably two of the more well-known big data engines. Of course with Python’s rise, especially in Data Science and ML, many of these tools target that audience.

One such newcomer is Bodo.ai, and I’ve seen them pop up on places like r/dataengineering. Fortunately, they have a free community edition, so let’s kick the tires and see what’s going on.

September 24, 2021

Big Data, Data, Data Engineering, Machine Learning, Python

Dask vs PySpark – Performance and Other Thoughts.

Every once in awhile I see someone talking about their wonder distributed cluster of Dask machines, and my curiosity gets aroused. I know plenty of people use Dask, mostly on their local machines, but it seems like the meteoric rise of Spark, especially with tools like EMR and Databricks, that Dask is slowly slipping into the shadows. I’ve had bad experiences with Dask in the past, trying to get it work well in production. I suppose that comes from working with tried and true Spark and other bullet proof distributed system. I’ve been meaning to return to Dask for awhile, compare a similar Dask and Spark cluster on performance … and other things like ease of setup and writing code. Let’s get too it.

September 6, 2021

Data, Data Engineering, Ramblings, Uncategorized

Why I both Love and Hate LeetCode

There are a few things in life I both love and hate. Let’s see …. hot weather, cold weather, working for a living, and …. LeetCode. I mean it is totally fun to push yourself and try to solve hard problems, but then the other side of me is like … well I’ve been writing code for years and 80% of this stuff is nothing like writing code in real life. I think the LeetCode platform itself is an amazing tool, and has provided both people and companies with an elegant way to showcase and practice skills. But is there too much of a good thing? Of course.

August 26, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs AWS EMR – Theory and Real Life.

I saw a recent post on r/datengineering, a question centered around why Databricks is so popular when tools like EMR have been floating around for so long. It got me thinking about it. It really isn’t all about the technical side and offerings, although that does play a large role. There are always proponents for every technology, old or new … like our favorite band or sports team, fight to the death for what we love and cherish. I want to talk theoretically, and technically about Databricks and EMR, and why you should use Databricks. 🙂

August 19, 2021

Big Data, Data, Data Engineering, Python

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Sometimes I amaze myself. I’ve been using PySpark for a few years now, happily crunching hundreds of TBs of data without much problem. Sure you randomly run into OOM errors and other such nonsense. Usually inspecting the code for something silly, throwing in a persist() or cache() here and there will solve 99% of the problems. I’ve always approached Spark performance with an overly pragmatic approach. Spark being the beast that it is, it’s easy to hide performance problems with more resources etc. I’ve generally tried to stay away from UDF's just using good coding practices and out of the box functionality. Ensuring good predicate pushdown’s, data partitioning etc are all helpful and important. But in the end… I don’t really know much about the out-of-the-box Spark configurations and how they affect performance.

Do the configurations change based on data size and partitioning strategy plus resources and cluster size? Probably. Does that seem complicated to figure out? Yes. Is the internet full of conflicting, vague and confusing advice? Of course.

August 13, 2021

Big Data, Data, Data Engineering, Ramblings

Build your Data Engineering skills with Open Source Data

There are many a day when I find myself scrolling through the subreddit for r/dataengineerg, it’s a fun place to stalk. Lot’s of people with lots of opinions make for interesting times. I see one question or a variation of it come up over and over again. How do I learn data engineering skills, how do I get into data engineering, what kind of problems do data engineers solve, blah, blah, blah? It’s a great question, and one without an easy answer. Well … there is an answer but it takes some time and willpower to get it done. Open source data. This is the way. Read books, take classes, do whatever, it’s hard to really learn the skills needed day-to-day as a data engineer without actually doing the work. But how do you do the work without the work? Make up your own work I say.

August 1, 2021

Big Data, Data, Data Engineering, Ramblings

5 Basic and Undervalued Data Engineering Skills

What is the standard for most data engineers these days? Turns out SQL and Python are still running the show pretty much across the board. There’s always a variety of skills in those areas, some better, some worse, although with a little work and repetition it’s pretty easy to master both SQL and Python. I’ve found that Python and SQL … or Java … or Scala … having good development skills is really only half the battle. It seems there is always a few basic data engineering skills that come up over and over. They are simple skills, foundational skills that allow an average data engineer to be better. They make a person more versatile and able solve more complex problems and work across a wide variety of of tech stacks and cloud providers. What are they? Read on my fair weathered friend.

July 26, 2021

Big Data, Data, Data Engineering, Python, Scala

String Slicing Performance – Python vs Scala vs Spark.

Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat files are still incredibly popular and common for storing and exporting data between systems. Hence string work comes upon us all like some terrible overload. The one you should fear the most is fixed width delimited files. I ran into a problem recently where PySpark was surprisingly terrible at processing fixed with delimited files and “string slicing.” It got me wondering … is it me or you?

July 17, 2021

Big Data, Data, Data Engineering, Data Warehousing, SQL

Data Lake vs Data Warehouse – What’s the Dealio?

Data Lake, Data Warehouse, Lake House, Data Mart, it’s always something isn’t it? Don’t get me started on Data Mesh. Yikes, it’s hard to keep up these days. I want to explore the Data Lake vs the Data Warehouse and what it really all boils down to, what is the real difference. Is it data modeling, architecture, storage? I think their are a few different things that differentiate a Data Lake from a true Data Warehouse, let’s talk.

July 10, 2021

Big Data, Data, Data Engineering, Data Warehousing, Ramblings

Databricks vs Snowflake. The DataLake/Warehouse Battle.

As someone who worked around the classic Data Warehouses back in the day, before s3 took over and SQL Server and Oracle ruled the day … I love sitting on the sidelines watching new … yet old battle-lines being re-drawn. I could probably scroll back in StackOverflow 12 years and find the same arguments and questions. In one sense Databricks and Snowflake are totally different tools … but are they? Distributed big data processing, apply transforms to data, enable Data Lake / Data Warehouse / Analytics at scale. There is a lot of bleed over between the two, it really comes down to what path you would like to take to get to the same goal.

June 28, 2021

The Wild West of Parallel Computing – Review of Bodo.ai

Dask vs PySpark – Performance and Other Thoughts.

Why I both Love and Hate LeetCode

Databricks vs AWS EMR – Theory and Real Life.

“Don’t mess with the dials,” they said. Spark (PySpark) Shuffle Partition Configuration and Performance.

Build your Data Engineering skills with Open Source Data

5 Basic and Undervalued Data Engineering Skills

String Slicing Performance – Python vs Scala vs Spark.

Data Lake vs Data Warehouse – What’s the Dealio?

Databricks vs Snowflake. The DataLake/Warehouse Battle.

Interesting links

Pages

Categories

Archive