I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark was the managing of Spark clusters etc. That has all changed the last few years with EMR, Databricks, and the like.
Back in those days, it was common for most Spark pipelines to be written with the DataFrame API, you didn’t see much SparkSQL around. I’m going to talk about how that has changed, what you should be using, and why.
Read more