I’ve come to have a great love for PySpark, it’s such an easy and powerful tool to use. I use it every day to crunch tens to hundreds of terabytes of data, without even blinking an eye. And all this with the ease of Python, it’s almost too good to be true. I have to say though, where things get a little dicey is when you need to do something maybe “out-the-box”, say, strange text manipulations, something that is easy in Python becomes a challenge in PySpark using DataFrame API functionality.
Sure, you could use a udf
written in Python for that, but we all know the performance penalty for that. Many times I just try to get creative with a combination of PySpark functions to accomplish the same task others would use a udf
for.
I want to talk about two wonderful PySpark functions I find myself using a lot, they come in handy and I rarely see them used, hopefully, they come in handy for you!
Read more