Top 10 Data Engineering Blogs
I’ve always been surprised with the rise of data engineering and big data, how hard it is to find good data engineering content that is somewhat regular. Tech moves fast and I feel like data engineering moves even faster. There are always new tools and systems coming out with regular frequency, it’s hard to keep up with what’s hot and whats not. But, I still think it’s important to keep a finger on the pulse of what tech stacks are starting to take over (Spark) and what is fading into oblivion. So here is my top ten list of data engineering blogs, these are the places that I frequent so I at least know what’s going on in the world of data engineering.
Top 10 Data Engineering Blogs / Resources.
And without further fanfare….
1. Netflix Tech (Big Data tagged)
https://netflixtechblog.com/tagged/big-data
Now most people may not have much in common with the streaming juggernaut Netflix, but their tech blog, with articles tagged for Big Data, is an incredible resource on what tools and processes are out there that are being used at an incredible scale. I mostly use the blogs as a tool to understand what technology stacks they are using, and how they are using them. It gives me a good idea of what I should consider when taking on new Big Data projects myself.
Also, they have a lot of posts that really dig into the architecture and talk about the ideas behind the ideas. Incredibly useful for data engineers trying to perfect their crafts, and become thinkers and much as doers.
2. Airbnb Data Science and Data Platform Engineering
https://medium.com/airbnb-engineering/data/home
You gotta love Airbnb, not only is their product amazing and wonderful, they’ve given the world Apache Airflow. I mean they have to be awesome to have contributed one of the most amazing and important data engineering tools. I follow this content for many of the same reasons I follow the the Netflix content, I want to know what tools are be used successfully at scale. But there is another reason I read this stuff.
Some of the posts go into incredible depth about certain pieces of technology and the problems they’ve had to solve and overcome. It’s priceless to read about lessons learned from people trying to solve the impossible with the best tools available. A good example and one of my favorite posts is On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies.
3. Jesse Anderson
https://www.jesse-anderson.com/category/blog/
Sometimes data engineering is about the bigger picture too. Someone I’ve followed for awhile for their good content centered around data engineering, especially for the big picture stuff. Being a good data engineer is all about writing good code, but it’s usually more than that too. Check it out.
4. Linkedin Engineering
https://engineering.linkedin.com/blog
The Linkedin Engineering blog is another great one, and has a surprising amount of data engineering and general data content. Again they cover a lot of high level architecture and tech stacks, and it’s a great way to keep up on the design patterns out in the wild. For example, a great article is Coral: A SQL translation, analysis, and rewrite engine for modern data lakehouses, it gives you a great understanding of how some enterprise petabyte scale data warehouses are run.
5. Uber Engineering
Yet another monolith engineering blog, yes of course Uber has be added to the list. I love this blog because it’s an awesome mix of high level topics like Revolutionizing Money Movements at Scale with Strong Data Consistency, all the way to in-depth tech like Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi. There is something here for everyone and its great to hear how some “less” popular tech stacks get used at scale.
6. Databricks Engineering
https://databricks.com/blog/category/engineering
This is a new one on the scene, Databricks has taken the Big Data world by force lately it seems. Their engineering blog is no different and is full of nuggets of lessons for the savvy data engineer. From super helpful topics like How to Manage Python Dependencies in PySpark, to super interesting How to Train XGBoost With Spark you would be crazy not to dig through this blog and find something to learn.
7. Andreas Kretz
https://www.teamdatascience.com/blog
Here is another good one! Andreas is the juggernaut of data science and data engineering content and learning on Linkedin and other platforms. There is a lot of content about getting into DS and data engineering, career advice etc.
8. Oracle Blog (Data Science)
https://blogs.oracle.com/datascience/
This one is a hidden gem, and I know it says data science, but there is a ton of great content for data engineers, especially those working around Machine Learning. For example a great post is A Simple Guide to Leveraging Parallelization for Machine Learning Tasks. You have to dig a little bit to get past the articles that are more about Oracle, but you can find good ones when you look.
9. Cloudera Blog (Data Engineering)
https://blog.cloudera.com/product/data-engineering/
People have been talking about the death of Hadoop for awhile, even though things like Hive are still going strong. The Cloudera blog has some awesome content, there are some amazing in depth topics like How does Apache Spark 3.0 increase the performance of your SQL workloads. Again, just like Oracle you have to sift through the marketing content, but once you do, there are some really smart people writing great ideas, it’s worth the dig.
10. Yelp Engineering Blog
https://engineeringblog.yelp.com/
Here is another general company engineering blog that is a gold mine. I mean where else can you read about Orchestrating Cassandra on Kubernetes with Operators or Migrating Kafka’s Zookeeper With No Downtime? There is steady supply of good reads just waiting for you.
Thanks for this wonderful post !
Nice list! I also recommend the newsletter “Data Engineering Weekly”: https://www.dataengineeringweekly.com/
It’s a curated list of the latest posts and updates in Data Engineering landscape 🙂