Have you ever wondered about being explicit in your code vs being vague? I think about this a lot as I’m writing code on a daily basis. I’ve found I like being explicit and verbose when writing code, rather than being vague in what I’m doing most of the time.
What is going on? Is the world coming to an end? I thought Python was going to live forever. Well, apparently not at Google. Recently Google announced it was laying off its entire North American-based Python team that was supporting Google’s special needs with Python, in favor of cheaper offshore workers.
Apparently, some of these Engineers were GOAT-level employees.
Is that really that much of a surprise? Probably not. Ever since Bard, Google has been failing hard. They still make a ton of money but old Google has gone by-by. It’s a new Era of Corporate Google, and it’s here to stay.
Heck, you can still make bank working in Software at Google … but you better save some of those greens for when your name gets called.
I was recently confronted with an interesting conundrum when writing a complex data pipeline. It was an interesting problem that arose from my quest to reduce complexity in part of the design, which found itself creeping into another part, re-enforcing the classic idea of whether you can really make the complexity pea go away, or if you simply shuffle the pea somewhere else to hide it.
It’s been a while since I wrote about Polars on this blog, I’ve been remiss. Some time ago I wrote a very simple comparison of switching from Pandas to Polars, I didn’t put much real effort into it, yet it was popular, so this is my attempt at trying to expand on that topic a little.
Recently, while laying flat on back on my sunporch soaking up the vitamin D beating down on me, dreaming about code, which I always do, it struck me.
Have you ever wondered at a high level what it’s like to build production-level data pipelines on Databricks? What does it look like, what tools do you use?
Ever wondered how to build and end-to-end project for an Open Source Python Package that gets published to PYPI? I built out lakescum
an open-source package to help with Databricks Unity Catalog Delta Lake tables querying with Polars, DuckDB, or PyArrow. https://github.com/danielbeach/lakescum
One thing all Data Engineers are doomed to do in purgatory will be to solve different date
and datetime
problems in an endless loop. I’m sure of it. I can’t imagine anything worse, so that must be it. Either way the constant need to manipulate date
s and datetime
s are just a way of life, something that never ends and never changes. Also, it appears Polars is here to stay from what I can tell. Not a fad like that Data Mesh. Since Polars is here to stay, (I’ve already got it running in production at my company, (don’t mind if I bow)), we should probably take a gander at how to manipulate date
and datetime
objects from both the Dataframe and (if I have time) SQL perspective. See if we can find anything to complain about. I like to complain.
Is there anything more Chad than Apache Airflow … and Rust? I think not you whimp. What two things do I love most? At the moment Rust and Airflow are at least somewhere at the top of that list. I wring my hands sometimes, wishing that things and technologies somehow come together into some bubbling soup and witches concoction from the depths. Then I had a strange thought while laying in bed one night.
What would happen if I ran my Rust inside my Apache Airflow? What would happen? Would the sun go dark? Would SQL Servers everywhere puke up their log files and go to Davey Jones’s locker? Birds fall from the sky? Why hasn’t anyone done this before, why isn’t anyone making this happen in real life?
I always leave it to my dear readers and followers to give me pokes in the right direction. Nothing like the teaming masses to set you straight. Recently I was working on my Substack Newsletter, on the topic of Polars + Delta Lake, reading remove files from s3 … I left a question open on my LinkedIn account.
I had someone jog my leaky memory in favor of DuckDB. I haven’t touched DuckDB in some time, and I’m sure it’s under heavy development what with that Mother Duck and all.
So, it’s time to talk about DuckDB + Delta Lake.
In the vast world of data, it’s not just about gathering and analyzing information anymore; it’s also about ensuring that data pipelines, processes, and platforms run seamlessly and efficiently. Nothing screams “why are flying by night,” than coming into a Data Team only to find no tests, no docs, no deployments, no Docker, no nothing. Just a mess and tangle of code and outdated processes, with no real way to understand how to get code from dev to production … without taking down the system.
This is where the principles of DevOps and Continuous Integration/Continuous Deployment (CI/CD) come into play, especially in the realm of data engineering. Let’s dive into the importance of these practices and how they’ve become indispensable in modern data engineering workflows.
Interesting links
Here are some interesting links for you! Enjoy your stay :)Pages
Categories
Archive
- November 2024
- October 2024
- September 2024
- August 2024
- July 2024
- June 2024
- May 2024
- April 2024
- March 2024
- February 2024
- January 2024
- December 2023
- November 2023
- October 2023
- September 2023
- August 2023
- July 2023
- June 2023
- May 2023
- April 2023
- March 2023
- February 2023
- January 2023
- December 2022
- November 2022
- October 2022
- September 2022
- August 2022
- July 2022
- June 2022
- May 2022
- April 2022
- March 2022
- February 2022
- January 2022
- December 2021
- November 2021
- October 2021
- September 2021
- August 2021
- July 2021
- June 2021
- May 2021
- April 2021
- March 2021
- February 2021
- January 2021
- December 2020
- November 2020
- October 2020
- September 2020
- August 2020
- July 2020
- June 2020
- May 2020
- April 2020
- March 2020
- January 2020
- December 2019
- November 2019
- October 2019
- September 2019
- August 2019
- July 2019
- May 2019
- March 2019
- February 2019
- January 2019
- December 2018
- November 2018
- October 2018
- September 2018
- July 2018
- June 2018
- May 2018
- April 2018
- March 2018
- February 2018