DuckDB Out Of Memory – Has it been fixed?
Back in March, I did a writeup and experiment called DuckDB vs Polars, Thunderdom, 16GB on 4GB machine challenge. The idea was to see if the two tools could process “larger than memory” datasets with lazy execution. Polars worked fine, DuckDB failed in spectacular fashion.
I also noted how many people had opened issues in GitHub about this very thing, but the issues were either ignored or closed. Someone on YouTube said some of these OOM issues were fixed in recent releases.
In an effort to keep DuckDB honest, I’m going to do another larger-than-memory data processing test with DuckDB to see if indeed this issue has been fixed. Also, lest you think larger-than-memory data processing is not important, it’s this feature that has been the bane of Pandas for so long and caused tools like Polars to supplant it.
If you want to be a serious tool, today, in the data landscape, you need to solve the problem of lazy processing larger than memory datasets, otherwise, you will always be a hobbled second-class citizen.
Testing DuckDB on Ubuntu with 16GB of data on 4GB machine.
So, we are going to test the latest version of DuckDB on a Ubuntu machine (4GB of RAM), with 16GB of Backblaze Harddive data.
aws s3 ls --profile confessions --summarize --human-readable --recursive s3://confessions-of-a-data-guy/harddrives
>> Total Objects: 183 >> Total Size: 15.9 GiB
Nothing new here.
Here is the same script we used last time.
import duckdb
from datetime import datetime
def main():
t1 = datetime.now()
duckdb.sql("SET temp_directory = 'temp';")
duckdb.sql("""
SET s3_region='us-east-1';
SET s3_access_key='key';
SET s3_secret_access_key='secret';
SET memory_limit = '3GB';
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-east-1';
SET s3_access_key_id='';
SET s3_secret_access_key='';
CREATE VIEW metrics AS
SELECT date, SUM(failure) as failures
FROM read_csv_auto('s3://confessions-of-a-data-guy/harddrives/*.csv', ignore_errors = true, union_by_name = true)
GROUP BY date;
""")
duckdb.sql("""
COPY metrics TO 's3://confessions-of-a-data-guy/harddrives/results.csv';
""")
t2 = datetime.now()
print(f"It took {total} to run DuckDB")
if __name__ == '__main__':
main()
Note I’m setting the memory limit to 3GB, to help DuckDB not overflow the memory, also explicitly setting the tmp directory to spill too. All things are recommended online.
BTW, I installed the latest version from PYPI for DuckDB.
root@localhost:~# pip install duckdb==0.10.2
What happens next?
You guessed it!
So what gives??
Who knows. Again, I see that there are many issues like this opened online, things working locally on the person’s machine, but failing with OOM on servers, Linux one of course.
I guess maybe DuckDB should just be relegated to the local laptop??? Not very promising for Mother Duck and its commercial hopes … when DuckDB still can’t run with OOM on a simple Linux Server … seems like there is a lot of work today for it to be taken seriously as a tool. Just saying.