Good ole’ string slicing. That’s one thing that never changes in Data Engineering, working with strings. You would think we would all get to row up some day and do the complicated stuff, but apparently you can’t outrun your past. I blame this mostly on the data and old schools companies. Plain text and flat files are still incredibly popular and common for storing and exporting data between systems. Hence string work comes upon us all like some terrible overload. The one you should fear the most is fixed width delimited files. I ran into a problem recently where PySpark was surprisingly terrible at processing fixed with delimited files and “string slicing.” It got me wondering … is it me or you?

Read more

One of the reoccurring complaints you always see being parroted by the smarter-then-anyone-else-on-the-internet Reddit lurkers is the slowness of Python. I mean I understand the complaint …. but I don’t understand the complaint. Python is what is is, and usually is the best at what it is, hence its ubiquitous nature. I’ve been dabbling with Scala for awhile, much to my chagrin, and have been wondering about its approach to concurrency for awhile now. I’ve used MultiProcessing and MultiThreading in Python to super charge a lot of tasks over the years, I want to see how easy or complex this would be in Scala, although I don’t think easy and Scala belong in the same sentence.

Read more

With parquet taking over the big data world, as it should, and csv files being that third wheel that just will never go away…. it’s becoming more and more common to see the repetitive task of converting csv files into parquets. There are lots of reasons to do this, compression, fast reads, integrations with tools like Spark, better schema handling, and the list goes on. Today I want to see how many ways I can figure out how to simple convert an existing csv file to a parquet with Python, Scala, and whatever else there is out there. And how simple and fast each option is. Let’s do it!

Read more
Trying to learn Scala drives me crazy.

I seriously don’t know why I keep doing this to myself. I know learning new things I something I need to do, but why Scala? I’m perfectly happy writing Python all day long. It’s straight forward and concise, no boilerplate, no re-inventing the wheel. I’ve written pipelines that crunch hundreds of TBs of data in Python, so all the snotty people who complain about Python not being fast enough or whatever can go hangout with this cow, looks like he could use a friend. This is something I’ve been meaning to do for awhile. Use Scala to read some text file(s), and store the data somewhere with some client. I chose ElasticSearch. I really just wanted practice doing something simple like reading files and I was curious about how good the Scala clients are for popular tools.

Read more

Man. Every time I open IntelliJ to write/learn some more Scala I have to take a deep breath. Yes, it’s been fun and good for my to my brain feel like in a Doctor Strange movie, but it’s also challenging and frustrating at times. One of the things I find myself doing a lot as a Data Engineering is HTTP stuff, mostly pulling files or data from APIs. Doing this work in Python is most enjoyable and easy, I’ve been curious to see how Scala handles HTTP stuffy stuff.

Read more

In Part 1 of my laborious journey from Python to Scala, I did some work with file operations, CSV files, and messing with the data. It took me a little longer then I expected to wrap my head around the Scala functional/object/immutable approach to software design. But, in the end if felt satisfying and I’m starting to be a convert. Scala makes you think a little harden then Python, is less forgiving, and requires more of you as the developer. In part deux, I figured the next topic to grapple with some simple retrieval of remote files and writing those files to disk. Also, I wanted to take a crack at Classes in Scala.

Read more

UPDATE: If you want to know how my Scala SHOULD have been written. Check out this link!

I feel like a frontiersmen heading west, into the unknown. I’ve been successful using Python as a Data Engineer for some time, processing terabytes of data with what “real” programmers sneer at as barely even a real language. Whatever. But, some of my favorite tools, like Spark, are written in Scala, and it’s on the rise, so I should probably join the lemmings in their mad dash. If for no other reason then to expand my horizons.

Read more