Reading large text files - inserting lines into a Database - best approach

Steven_Taylor · May 6, 2017, 5:01am

The task that I’m currently fiddling with is selecting a folder, then cycling through all the files within the folder, processing each file (text files), splitting each line and inserting said split data into an SQLite database. At present, there are 121 files, and some of the larger files are 150k lines, others around 40k lines. It takes a long time to process.

The process I’ve developed so far is like so:

1. Get folder where files reside
2. Determine the number of files (121)
3.      Get & Open File 
4.           Readline
5.                Do some analysing of line - insert line parts into a database table
6.                Repeat - until EOF (return to 4 - Readline)
7.      Close File
8.      Repeat - until last file (return to 3 - Get & Open File)

The above does not utilise any arrays (not that I created).

My question is, what would be the most efficient way to handle this process. At this point, I’ve only cycled through five and ten files a time for testing purposes, and that’s not including the larger files. It’s taking a long time.

It’s become obvious that I need some form of progress to at least indicate that the process is progressing and hasn’t fallen over.

What is the best way to determine how many lines a text file has? Is this inefficient to attempt in the first place?

For what it’s worth, I’m converting QIF text files that were exported out of Reckon Accounts (Quicken).

Any suggestions welcome.

Jeff_Tullin · May 6, 2017, 5:24am

If each file is not gigabytes in size, you may as well read it whole into a string, and process it in memory
Strip off header records like

!Type:Bank

Then Split() the string on the ^ character to chop it into records which will be in an array
Now process the array end to end , parsing each record, either left as is, or further split() on end of line

D03/03/10 T-379.00 PCITY OF SPRINGFIELD

Jean-Yves_Pochez · May 6, 2017, 6:38am

I would also try to read each file completely in memory, do a split, and store in the database
with ONE insert statement for each file.
this would make 121 files readall, and 121 inserts into the database.
this could go (much) faster

Steven_Taylor · May 7, 2017, 12:51am

Thanks @Jeff Tullin and @Jean-Yves Pochez, it’s all so simple now. I’ve only tested against one small text file, but all is looking good at the moment.

I must admit reducing the database inserts to only one SQL statement through me, so I hit the Google machine for examples. I’ve not tested this part as yet. I’m hoping it’s possible to achieve using a prepared statement.

Jeff_Tullin · May 7, 2017, 6:55am

What you do depends upon whether you want the each text files to create records, or whether you just want to store the data.
And which of those options depends upon what you want to do with the text when it is in your database.

For example, if all you want to do is store the file so that someone can see the file at a later time, then define the table with a BLOB (CLOB) field, and having read the text into a big string, push the string into the BLOB field using a prepared statement.
Thats stored, but the data is not information as you cant do analysis on it.

If you expect to do maths on the individual records that the text file contained, then you need to use my version, which is to read the whole string in, chop it up in memory, save records into the database , and move on
With records to play with you can query the database for ‘sum of bank transactions’ or ‘sum of positive values’ etc

Steven_Taylor · May 7, 2017, 7:47am

[quote=329570:@Jeff Tullin]What you do depends upon whether you want the each text files to create records, or whether you just want to store the data.
And which of those options depends upon what you want to do with the text when it is in your database.

For example, if all you want to do is store the file so that someone can see the file at a later time, then define the table with a BLOB (CLOB) field, and having read the text into a big string, push the string into the BLOB field using a prepared statement.
Thats stored, but the data is not information as you cant do analysis on it.

If you expect to do maths on the individual records that the text file contained, then you need to use my version, which is to read the whole string in, chop it up in memory, save records into the database , and move on
With records to play with you can query the database for ‘sum of bank transactions’ or ‘sum of positive values’ etc[/quote]

No, not considering just storing the text file. In the conversion process, this is about the fifth process. By reaching this stage, I’ve already converted other QIF text files that relate to the Account List, Category List, Security List and Share Price history. The QIF files subject of this thread, they are the transaction files, a separate QIF file per account, that will be inserted into one temporary table. All the tables at this stage are temporary; they will be deleted upon successful normalisation.

As regards multiple inserts (in one SQL statement) it seems like that may be problematic for SQLite but, what I did discover is the use of BEGIN/COMMIT drastically improved the speed of inserts. Where a file was taking 50 - 60 seconds to update, with the use of BEGIN/COMMIT it seemed to be instantaneous, so I’m happy with that.

Steven_Taylor · May 10, 2017, 12:21am

I’ve had a few days to play around and tweak my code. Where I thought the bottleneck was processing large text files by performing a Readline method, well that wasn’t the case. Keeping my code as it was (retaining the Readline approach, and merely wrapping it in a BEGIN / COMMIT Transaction drastically improved speed.

As suggested by Jeff and Jean, I will replace the Readline method and copy the text files to a string, split it, and use an array to process it; as that offers more options (count, reading ahead).