Data Engineering Archives - Page 21 of 23 - Confessions of a Data Guy

My Journey from Python to Scala – Part 1

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link!

I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. Whatever. But, some of my favorite tools, like Spark, are written in Scala, and it’s on the rise, so I should probably join the lemmings in their mad dash. If for no other reason then to expand my horizons.

May 6, 2020

Data, Data Engineering, Python, Uncategorized

Big Data File Showdown – Avro vs Parquet with Python.

There comes a point in the life of every data person that we have to graduate from csv files. At a certain point the data becomes big enough or we hear talk on the street about other file formats. Apache Parquet and Apache Avro are two of those formats that been coming up more with the rise of distributed data processing engines like Spark.

April 5, 2020

Data, Data Engineering, Machine Learning, Python

Challenges of Machine Learning Pipelines at Scale… When You Don’t Work at Google.

Complexity is in the eye of the beholder.

ml pipelines

Building Machine Learning (ML) pipelines with big data is hard enough, and it doesn’t take much of a curve ball to make it a nightmare. Most of what you will read online are tutorials on how to take a few CSV files and run them through some sklearn package. If you are lucky, you might find some “big data” ML stories on Medium where someone uses Spark to crunch a bunch of JSON, Parquet, or CSV files at scale of 10 to a few hundred gigabytes of data. Usually they are simplistic and ambiguous. Unfortunately that isn’t how it works in the real world.

March 14, 2020

Data, Data Engineering, Python, Uncategorized

Apache Airflow for Data Engineers

On again, off again. I feel like that is the best way to describe Apache Airflow. It started out around 2014 at Airbnb and has been steadily gaining traction and usage ever since, albeit slowly. I still believe that Airflow is very underutilized in the data engineering community as a whole, most everyone has heard of it, but it’s usage seems to be sporadic at best. I’m going to talk about what makes Apache Airflow the perfect tool for any Data Engineer, and show you how you can use it to great effect while not committing to it completely.

January 11, 2020

Data Engineering, Python, SQL

Introduction to Postgres with Python

Python and Postgres, a match made in heaven.

If there was ever a match made in heaven, it’s using Python and Postgres together. They were made for each other. Both are fun and easy to use, addicting, both have so many surprises and hidden gems. Like Gandalf and Frodo, the two just go together. Today I want to go through the basics of interacting with Postgres using Python. In the beginning of my data career this was often a point of pain, even though it seems like it should be easy. Let’s hit on the basics and then a few of the not-so-obvious things I wish I would have known in the beginning.

December 30, 2019

Data, Data Engineering, Python

Exploring ElasticSearch with Python

What’s Elasticsearch precious? I feel like Gollum when confronted by taters. Elasticsearch has been around for awhile now, based on Lucene, it’s become a well known name in the field of text and semi structured data storage, analysis and retrieve category. Even though it’s popular enough to get name recognition I’ve rarely run across it in the wild. We are going to dip our toes into Elasticsearch by working on a small project to store and search a book(s). It just give us enough simple problems to solve that by the end we should have at least a basic understanding of how to connect, store, and retrieve simple documents with Elasticsearch.

December 17, 2019

Data, Data Engineering, Python

3 (Or More) Ways to Open a CSV in Python

Ah. What a classic. The one piece of code that I end up writing over and over again, you would think I would have stashed it away by now. Not going to lie I usually have to Google it, while thinking, is this the right way? Should I just open the csv file and iterate it? Should I import the csv module? Should I just use Pandas? Does it matter? Probably not.

November 27, 2019

Data, Data Engineering, Geospatial, Python

Thunderdome for Geospatial Tools in Python. It’s to the Death.

A fight to the death. A comparison of geo-spatial tools in Python. What’s easy and fast to use.

It’s a fight to the death people… that’s why it’s called Thunderdome. This will be no different. Last time we talked about the very basics of the strange world of geo-spatial tools for data engineering. The next most obvious thing do of course is to see what tool is the best. By best I mean what tools can be used to load and do simple manipulation of data in a fast and relatively simple manner.

November 23, 2019

Data, Data Engineering, Python

Python Async File Operations – Juice Worth the Squeeze?

Async file operations in Python, juice worth the squeeze?

What I’ve greatly feared has come to pass. I’ve come to love on of the most confusing parts of Python. AysncIO. It has this incredible ability for data engineers building pipelines in Python to take out so much wasted IO time. It saves money. It’s faster. People think you’re smarter than you are. Tutorials are one thing but implementing it in your complex code is typically mind bending and a test of your patience and self-worth.

October 17, 2019

Data Engineering, Machine Learning

Data Engineering a Machine Learning Pipeline In Real Life.

The pipelines behind the machine learning.

It doesn’t take long reading articles on Medium or Towards Data Science to become enamored with Machine Learning. Especially the people and companies who do it in “production.” I always read about the big picture, the fancy algorithms, the cloud computing, but you have that feeling there is something missing. It’s all the details that are missing. Where is the force behind it all, bringing everything together? I like to think it’s called Data Engineering, with some Dev Ops for good measure.

October 1, 2019

My Journey from Python to Scala – Part 1

Big Data File Showdown – Avro vs Parquet with Python.

Challenges of Machine Learning Pipelines at Scale… When You Don’t Work at Google.

ml pipelines

Apache Airflow for Data Engineers

Introduction to Postgres with Python

Exploring ElasticSearch with Python

3 (Or More) Ways to Open a CSV in Python

Thunderdome for Geospatial Tools in Python. It’s to the Death.

Python Async File Operations – Juice Worth the Squeeze?

Data Engineering a Machine Learning Pipeline In Real Life.

Interesting links

Pages

Categories

Archive