DataFrames vs SparkSQL – Which One Should You Choose?
I’ve been amazed at the growth of Spark over the last few years. I remember 5 years when I first started writing about Spark here and there, it was popular, but still not used that widely at smaller companies. AWS Glue was just starting to get popular, it seemed the barrier to widely adopted Spark was the managing of Spark clusters etc. That has all changed the last few years with EMR, Databricks, and the like.
Back in those days, it was common for most Spark pipelines to be written with the DataFrame API, you didn’t see much SparkSQL around. I’m going to talk about how that has changed, what you should be using, and why.
Spark DataFrame vs SparkSQL
You really have two options when writing Spark pipelines these days …
Honestly, I think they could not be two more different options. People are probably going to fall into one of the two camps, and there are a lot of reasons for that.
// SparkSQL
df.createOrReplaceTempView("my_table")
results = spark.sql("select * from my_table")
// Dataframes
df = spark.read.parquet('s3a://my-bucket/my-data-source/')
And it only gets more different from there.
The Big Picture
What are we doing when we write Spark pipelines, we are writing ETL or Transforms … we are manipulating data. SparkSQL and DataFrames are basically your two options for doing those transformations. SparkSQL is just that, SQL-based transformations, the DataFrame option is more method and calls, more familiar to the programmers.
You can get the same result using both SparkSQL and DataFrame, in fact, most of the built-in Spark functionality mirrors each other between SparkSQL and DataFrame. So does it really matter what you choose if that’s the case?
I would say yes. Most likely your pipelines couldn’t be any more different between the two options unless you are an extremely disciplined person or data engineering team.
DataFrame Pipeline Example
What does a typical DataFrame pipeline in Spark look like?
from pyspark.sql import DataFrame, SparkSession
import spark.sql.functions as F
def read_sales_data(uri: str = 's3a://sales-data/customer-orders/2021/*') -> DataFrame:
df = spark.read.parquet(uri)
return df
def define_product(input_df: DataFrame) -> DataFrame:
output_df = input_df.withColumn('product',
F.when(
F.col('product_id').isin([1,2,3,4]), F.lit('product_one')).otherwise(F.lit('product_two')
)
)
return output_df
def agg_sales_by_product(input_df: DataFrame, gb: str = 'product', ag: str = 'order_amount') -> DataFrame:
output_df = input_df.groupBy(gb).agg(F.sum(F.col(ag).alias('sales'))
return output_df
df = read_sales_data()
products = define_product(df)
metrics = agg_sales_by_product(products)
What do we notice about this type of PySpark code using DataFrames? Well …
- it’s modular ( aka reusable)
- it’s easily unit testable
- easy to read
- logic is broken up and encapsulated.
SparkSQL Pipeline Example.
How might this look with SparkSQL?
from pyspark.sql import DataFrame, SparkSession
import spark.sql.functions as F
def read_sales_data(uri: str = 's3a://sales-data/customer-orders/2021/*') -> DataFrame:
df = spark.read.parquet(uri)
return df
def run_sql(input_df: DataFrame) -> DataFrame:
input_df.createOrReplaceTempView('tmp_sales')
df = spark.sql("""
SELECT CASE WHEN product_id IS IN (1,2,3,4) THEN 'product_one' ELSE 'product_two' END as product,
SUM(order_amount) as sales
FROM tmp_sales
GROUP BY CASE WHEN product_id IS IN (1,2,3,4) THEN 'product_one' ELSE 'product_two' END
""")
return df
df = read_sales_data()
metrics = run_sql(products)
What do we notice about this code?
- its compact
- all the logic is together
- unit testing would be less trival but not impossible
- easy to read
- probably not easily reusable
SparkSQL vs DataFrames, does it really matter?
My guess is as a data engineer you probably felt drawn or related more two one of the two sets of code above more than the other? Why?
Probably habit.
Some of us come from more of a software engineering background, maybe you like the DataFrame stuff a little more, it feels more like code because it is. Maybe you came from a classic database or SQL developer or analytics background, you probably love the ease and familiarity of writing SQL … and with the power of Spark! Hard to beat.
Learning Important Lessons from Each Other.
I think the difference between SparkSQL and DataFrames in pipelines is probably more theoretical and emotional than anything else. We all probably just feel more comfortable with one or the other … but we shouldn’t let that put it in a box. It’s easy to get comfortable, and that can be a path that doesn’t lead to growth and learning.
What can DataFrame folks learn about using SparkSQL?
- Sometimes the readability of complex transformations is easier to follow and wrap your mind around in SQL
- There are wonderful features like CTE’s you can use in SparkSQL that can make your life easier
- Some transformations just might be easier and shorter amount of code in SparkSQL!
What can SparkSQL folks learn from using DataFrames?
- the ease and important of unit testing
- making code more modular and reusable
- learning to write better code, and becoming comfortable programming, instead of always defauting to SQL.
Musings
I’ve seen the good and the bad of SparkSQL taken too far, and the use of DataFrames taken too far. Like most things in life, it’s about using the correct tools at the correct time. We are looking for balance in all things.
Sometimes certain operations like UPATES, DELETEs, or CTE’s simply just make more sense and make for the more legible and workable pipeline. Other times breaking up logic into reusable units that are easily tested can make a large repository less code overall and less brittle to breakage and bugs. I think that mixing the two is probably the best approach, and switching it up from time to time to teach yourself something new is always a good idea.