Home - Confessions of a Data Guy

Big Data, Data, Data Engineering, Python

Data Engineering/Data Pipeline repo Project Template (free).

Probably one of the hardest hurdles to jump over when starting out in anything new, including Data Engineering and Data Pipelines, is knowing where to start. It always can be a little daunting. One aspect that can make or break any project, giving you the confidence to move forward like Sparticus to conquer, is having […]

May 8, 2022

Big Data, Data, Data Engineering, Python, Ramblings

Review of Prefect for Data Engineers

Not going to lie, I do enjoy the vendor wars that this marketing craze called “The Modern Data Stack” has created. I like to keep just about everything in life at arm’s length. Kinda like the way you look at your crazy third cousin out of the corner of your eye at the family reunion. […]

April 30, 2022

Big Data, Data, Data Engineering, Data Warehousing, Python

Reducing Complexity with Databricks + Delta Lake COPY INTO

As the years drag by in Data Engineering, there are a few things that I have come to appreciate more and more. One of those topics that is close to number one on the list is complexity reduction. Today’s modern data stacks are filled to the brim with technologies and tools, full to the brim, […]

April 14, 2022

Big Data, Data, Data Engineering

Data Pipelines 101 – The Basics.

I’ve been getting a lot of questions lately about data pipelines, how to design them, what to think about, and what patterns to follow. I get it, if you’re new to Data Engineering it can be hard to know what you don’t know. There is a lot of content specific to certain technologies, but not […]

April 6, 2022

Big Data, Data, Data Engineering, Golang, Ramblings

Golang – Useful for everyday Data Engineering?

I periodically try to pick up a new programming language on my journey through Data Engineering life. There are many reasons to do that, personal growth, boredom, seeing what others like, and helping me think differently about my code. Golang has been on my list for at least a year. I don’t hear much about […]

March 17, 2022

Big Data, Data, Data Engineering, Data Quality, Data Warehousing, Python

Data Quality – Great Expectations for Data Engineers

Mmmm … Data Quality … it is a thing these days. I look forlornly back to the ancient days of SQL Server when nobody cared about such things. Alas, we live in a different world, where hundreds of terabytes of data are the norm, and Data Quality becomes a thing. I’ve been meaning to give […]

March 17, 2022

Big Data, Data, Data Engineering, Python

Databricks Access Control – The 3 Most Important Steps

It’s not often I yearn for the good old days of SQL Server, but I’ve had a few of those moments lately. Some things I miss, some I don’t, and it’s probably because I’m getting old and crusty, stuck in my ways, by permissioning is one of those topics where I think about the good […]

March 3, 2022

Data, Data Engineering, Python, Ramblings

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

For any Data Engineer working on aws for any length of time, there is one task that always seems to come up and never go away. Manipulating files on s3 a bucket on aws is something I’ve had to do for years, it just never goes away. It’s always something … listing files, moving files, […]

February 28, 2022

Big Data, Data, Data Engineering, Data Warehousing

Part 4 – Keys To Success – Idempotency and Partitioning.

As the road winds on we come to Part 4, of our 5 Part Series on Data Warehouses, Lakes, and Lake Houses. Finally, we are getting to some fun topics after all the boring stuff. Today I want to talk about the two keys to success in your Data Lakes … Idempotency and Partitioning. I […]

February 9, 2022

Big Data, Data, Data Engineering, Data Warehousing

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Is there any problem more classic to the Data Lakes and Data Warehouses than duplicate records? You would think after doing the same ETL for over a decade I could avoid the issue, apparently not. It’s good never to think too highly of one’s self, the duplicates can get us all. Today I want to […]

February 3, 2022

Data Engineering/Data Pipeline repo Project Template (free).

Review of Prefect for Data Engineers

Reducing Complexity with Databricks + Delta Lake COPY INTO

Data Pipelines 101 – The Basics.

Golang – Useful for everyday Data Engineering?

Data Quality – Great Expectations for Data Engineers

Databricks Access Control – The 3 Most Important Steps

boto3 or aws cli for s3 … Python vs Bash and other thoughts.

Part 4 – Keys To Success – Idempotency and Partitioning.

Databricks + Delta Lake MERGE duplicates – Deterministic vs Non-Deterministic ETL.

Interesting links

Pages

Categories

Archive