Pandas DataFrame.to_sql() …. { how you should configure it to not be that guy. }
I never understand it when someone comes up with a great tool, then defaults it to work poorly… leaving the rest up to imagination. The Pandas dataframe has a great and underutilized tool… to_sql()
. Lesson learned, always read the fine print I guess. I’m usually guilty of this myself… wondering why something in slow and sucks… and not taking time to read the documentation. Here are some musings on using the to_sql() in Pandas and how you should configure to not pull your hair out.
Setting up to test Pandas.DataFrame.to_sql()
So what I want to do is test some of the configuration values on Pandas.DataFrame.to_sql() have on performance on reasonably sized datasets. Firstly, I don’t use Pandas that much myself. I’ve found it nearly impossible to use for big data, it’s a memory hog and only useful on what I would consider small datasets that fit in a not high-mem instance memory footprint. True Believers use Spark DataFrames instead. Either-way all you raving Pandas fans can keep your comments to yourself. I digress.
I’m going to use a 2,438,457
row sized data frame built from Divvy Bike Trip free datasets, I will then try dumping this dataframe straight to a Postgres table. This isn’t that unusual of a data flow. First I’m going to run just a basic Docker Postgres instance.
FYI. Code is available on GitHub.
Run local Postgres
docker run --name postgres -e POSTGRES_PASSWORD=postgres -d -p 5432:5432 postgres
Now you will just have the base Docker postgres image running locally at localhost
or 0.0.0.0
listening on part 5432. Default user postgres
and password as well.
Mash data together.
Next I downloaded 5 csv files from 2020 using the above Divy link.
202006-divvy-tripdata.csv 202007-divvy-tripdata.csv 202008-divvy-tripdata.csv 202009-divvy-tripdata.csv 202010-divvy-tripdata.csv
There are about 2.4 million rows so this should be a good test.
from glob import glob
import pandas as pd
def gather_file_names() -> iter:
file_names = glob("*divvy*.csv")
return file_names
def combine_files(incoming_files: iter) -> pd.DataFrame:
blah = pd.concat(
[
pd.read_csv(csv_file, index_col=None, header=0)
for csv_file in incoming_files
],
axis=0,
ignore_index=True,
)
return blah
def main():
gather_file_names()
my_lovely_frame = combine_files(incoming_files=gather_file_names())
print(len(my_lovely_frame.index))
if __name__ == "__main__":
main()
Testing out Pandas.DataFrame.to_sql()
So now let’s test out the “out of the box” dataframe to SQL function of Pandas. The to_sql
function takes a sqlalchemy
engine to connect to the database, in our case a local Postgres running via Docker. ( you will need to pip install psycopg2
as well.. you will have best luck on mac os using pip3 install psycopg2-binary
)
Let’s add a few lines of code to create the sqlalchemy engine and call the to_sql
method without messing with any of the parameters.
from sqlalchemy import create_engine engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres') t1 = datetime.now() print("starting to insert records.") my_lovely_frame.to_sql(con=engine, name="trip_data", if_exists="replace") t2 = datetime.now() x = t2 - t1 print(f"finished inserting records... it took {x}")
Yikes. 47 minutes uh? Bahaha!!
starting to insert records.
finished inserting records... it took 0:47:14.091238
Time to play, let’s try setting chunksize
.
my_lovely_frame.to_sql(
con=engine, name="trip_data", if_exists="replace", chunksize=50000
)
Hmm…
starting to insert records.
finished inserting records... it took 0:46:55.817583
Wow.. that was useless. Let’s try the multi
(multiple inserts per statement)
my_lovely_frame.to_sql(
con=engine, name="trip_data", if_exists="replace", chunksize=50000, method="multi"
)
Well that was an improvement, no surprise there. Not doing a insert
statement for every single row seems like a no brainer…
starting to insert records.
finished inserting records... it took 0:09:15.084440
Moral Of Story
Configurations and documentation matter. From 47
minutes to 9
minutes is a bit of jump for just adding method="multi"
wouldn’t you say?
Just Because I Can…
I was curious about the fastest way with Python to get 2.4 million rows into the database without using COPY
.
Multiprocessing? Let’s add a new function.
from concurrent.futures import ProcessPoolExecutor
from random import randint
def pandas_smandas(file_uri: str) -> None:
my_lovely_frame = pd.read_csv(file_uri)
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")
print("starting to insert records.")
say_what = randint(1, 100)
my_lovely_frame.to_sql(
con=engine, name=f"trip_data_{say_what}", if_exists="replace", chunksize=50000, method="multi"
)
def main():
t1 = datetime.now()
with ProcessPoolExecutor(max_workers=5) as PoolParty:
PoolParty.map(pandas_smandas, gather_file_names())
t2 = datetime.now()
x = t2 - t1
print(f"finished inserting records... it took {x}")
Well of course that was much faster.
finished inserting records... it took 0:02:40.825578
Got any other ideas to make it faster? Leave a comment below!
Why not just use copy via psycopg2?
Because pandas.dataframe.to_sql() can be used for other rdbms besides Postgres. Also, the to_sql function is convenient and widely used because of Pandas, so it’s good to know how to use it properly when needed. Thanks for reading the blog.