Converting large CSV file to SQLite or splitting with Binary Stream

I have a large (1TB) CSV file that I need to convert to SQLite. When I try to use a Text Stream to convert it to SQLite, it works for a while, then dies with a buffer error after converting about 3GBs of data. I am reading one line, then saving that line to the SQLite database in a loop.

So my thinking as to move forward is three fold.

A) I have not tried to use a Binary Stream to look for the end of line, converting lines chunk at a time. My idea is that a Binary Stream may be less taxing on the system. Please feel free to tell me if I am wrong.

B) The second approach is to read only so many lines, then take a break with a timer. So that only so many lines are read. As each line read is added to the SQLite database, I could close the database, after a chunk of reads, which may help somehow.

Looping opening the database, then reading so many lines from the CSV file, writing to the database, then closing the database, and rereading may help. All the while I need to keep the CSV file open to hold the position of the reading pointer.

C) The last option would be to open the input file, open a new output file, write a GB of data, close the output file, open a new output file, and continue. Basically, split the CSV file into GB chunks. Then do the import one file at a time. The theory here is that keeping the SQLite database open with all these writes, fills a buffer for rolling back the transactions. And this is what is killing the process. So splitting the file would solve this.

Normally I don’t need to mess with data files of this size, so this is a little new to me. Any suggestions would be great. It’s been decades where I needed to think of memory size limits in an application. But this large file is blowing up my past solution.

Thanks for your help

I wrote a CSV parser recently that reads a buffer into a string and then reads internally from that string to get lines and fields for processing (with Split and some additional code to handle quoted and multi-line fields). The optimal buffer size was around 100k. That approach may help on the read side.

As far as writing to SQLite, is there a reason you think it is the database connection? Depending on how you are adding data to SQLite, I think it commits as you are going unless you are explicit about transactions.

If the goal is just to get the file into SQLite, you may also want to look at the tip here (SQLite is not a toy database | Anton Zhiyanov) under “Performance.”

(OFF) - the CSV file isn’t on a USB stick, is it?
I ask because I recently started seeing a couple of 64Gb drives I own slow waaaaayyyyy down after constant file access of over 20 minutes, as they started to get warm.
One hasn’t recovered.

No the files (both database and CSV are on SSD drives.

Not sure how to implement that in XOJO but it’s worth looking into. If the IMPORT command is supported … why not let the database do the import on it’s own.

I’ll need to do some poking to find out how that may work.

Thanks Eric

Thought was, if you just have to import it once (or it doesn’t have to be part of your application), don’t bother with Xojo and just do it from the SQLite command line.

I keep getting request like this to add features for other clients from time to time.
That said, I did end up using the Mac OS Split command which is super simple.
Split -l 1000000 filename.
This will split the file ever million LINES (so I don’t need to worry about line endings)
It’s similar to ReadLine with a text input stream in that it won’t break a line of data (record).
Split -b would break on a number of bytes - that WOULD split a record.
So the -l works like readline.

Now I just need to writer a loop that opens each CSV file, imports that chunk to the SQLite database.
Closes both.
Then does it again untill all the parts are processes.

The CSV Virtual Table

P.S.
Doing a “simple” split on a cvs file fails when a field contains the field character separator (comma or anything else).

Right. So the first thing to remember is that any file stream method that you use must have access to the amount of RAM for the amount of data that your app will hold in memory at any given time. Second we can pretty safely assume that you don’t have over 1TB of RAM unless you are working on a high performance server, so you’ll need to read in chunks as you mentioned. Next, that threshold that you reached is significant… 3GB. The amount of RAM that a single String or Text can hold is limited in size and depending on the OS could be right at this point.

My suggestion would be to start by using a TextInputStream to read each line, split it apart and add its data to the on-disk database. I clarify that because you’ll have the same problem with an in-memory database if it gets too large.

If that turns out to be too slow, you could use a BinaryStream to read chunks of data into RAM in chunks that are, let’s say, the size of 10 average lines and add them to a buffer. Than what you do is split the data into lines and process n-1 lines and put the last line back onto the buffer before reading and adding the next chunk. This is to allow you to deal with not getting the whole last line. This would most likely be faster, but you’ll have to take more care with text encodings.

Just out of curiosity, which platform are you running your app on, is your app 32 or 64-bit and how much RAM do you have?

1 Like

Oh, and you might squeeze a little more speed by compiling your app using Aggressive optimization if you’re using 64-bit.

Reading by line is very slow. I read one MB. Then I split the data by endOfLine. The last line remains as buffer and is added to the next MB.

Exactly right – read a buffer. I wrote a parser that does that and takes an optional buffer size (to include unlimited) and then ran some experiments:

Buffer 5M/624MB 2M/249.6MB 1M/124.8MB 500k/62.4MB 100k/12.5MB
Unlimited 193 50.95 22.02 9.68 1.77
2M 84.78 33.4 16.88 8.4 1.68
1M 85 33.47 16.7 8.47 1.73
500k 84.13 33.73 16.65 8.45 1.7
100k 83.85 33.15 16.6 8.25 1.65
50k 83.56 33.37 16.8 8.25 1.73
10k 84.35 33.38 16.78 8.4 1.67

where the file sizes in the header are rows/bytes and the cells are runtimes in seconds. Interestingly, the optimal size is somewhere near 100k.

The other things to watch out for are quoted fields (which may contain the delimiter) and multiline fields (which have to be quoted).