Skip Lines of CSV files with DuckDB and Polars
There are some things you don’t need until you need them. I ran into that situation recently with needing to process some CSV / Flatfiles on short notice. At first, it appeared to be easy, but then I realized, as usual, there was a little monkey wrench thrown into the middle of it.
It is nothing earth-shattering, it’s just something that comes up so rarely that I forget there are ways to deal with these inconveniences without jumping through unnecessary hoops.
Skipping unwanted headers / lines in CSV or TXT files
It’s pretty straightforward and more common than you think, at least common enough to pop up for me in Data Engineering every 2 years or so. Those aggravating leading headers that are not headers in .TXT or .CSV files. They typically show up as dates, or some other nonsense in flat files before the file or headers actually begin.
Something like this …
Shall we all take a moment to go and strangle the little devils who thought this was a good idea (usually some COBOL programmer who made an export out of some mainframe system)? So annoying.
If you first run across files like this you need to process, it’s easy to start jumping to all sorts of strange and wonderful incantations of programming to solve this issue. I mean, that’s what we are good at, writing code, so it’s always fun to write more code right?
Wrong.
You just have to pick the right tool and know all the options available to you in those tools. Most GOOD data tools provide some sort of option to skip X number of lines when reading files. A life saver.
DuckDB skipping CSV rows
Polars skipping CSV rows
Who knew it would be so easy, it’s the small things in life, and that’s the beauty in code, right? The problem with NOT knowing that you can skip lines in tools like DuckDB or Polars is that we not only add more code to a codebase, but we add the …
- additional complexity
- additional breakpoints
- simply increase the codebase size
It’s hard to know all things about all the tools that solve all problems all the time. It simply isn’t possible. BUT, there is something you should get into the habit of doing.
Reading the documentation.
What you need to do, is when you run into a new problem that you aren’t familiar with, simply go to Google, and look up the documentation for the method that applies to your problem. In this case … “… skipping records when reading a CSV file with tool XYZ.”
Simple things. Good tools.