Parsing a VERY large text file?

I have an app that processes data in plain vanilla text files. I’ve been parsing the data using this code:

dim FileChunk as string
dim FileData(-1) as string

FileChunk = textinput.ReadAll 'Read the data from the file
FileData=split(FileChunk, EndOfLine.Windows) 'Split it into rows

The next step in the process is to go through the rows, searching for key words in each row and then depending on what’s found, handing off the contents of one or more subsequent rows into other processes.

With 99% of the files being processed, this works fine - however, I was recently handed one incredibly large file (331,847,701 bytes), it causes the Xojo runtime to crash, no doubt because there just isn’t enough memory space available for that many bytes all at once.

Does anyone have ideas on how I can parse this file sequentially without using massive amounts of memory?

Thanks!

You can try to use the “ReadLine” method instead of the “ReadAll” method

Since you are processing a line at a time, why don’t you read the file a line at a time?

http://documentation.xojo.com/index.php/TextInputStream
http://documentation.xojo.com/index.php/TextInputStream.ReadLine

I used to process the files using ReadLine, but it wound up being much slower than doing it all in memory. It would definitely be the simplest solution, maybe what I should do is check the file size and if it’s super large then switch to reading a line at a time.

Thanks!

read it in chunks, couple meg at a time…
It will require using BinaryStream instead of TextInputStream
and you will have to do some more checking for odd size chunks at end of file…

holding it in memory could require several copies at once - hence a very large memory footprint

there are options

  1. do it a chunk at time by reading from disk, processing, writing to disk to minimize the amount of the file in memory at once

  2. read it into a sqlite db in memory (limits the number of copies you have) then process each “line” in the DB
    you may be able to do it all in sql which could make this quite quick depending on what processing you do

I’d read, say, 1,000 lines at a time into an array, process it, read the next 1,000, etc. If it takes too much time, I’d put this into a Timer or Thread so the UI will remain responsive.

Another idea: Read a chunk of data, rounded up to a whole line, process that, repeat. Something like this:

dim chunk as string = tis.Read( 1000000 )
chunk = chunk + tis.ReadLine

(I haven’t tried this to see what issues might arise if the initial read ends in the middle of a CRLF combo, or any other issues.)

300 MB isn’t that large. You should be able to read it all, split it into an array, empty the original variable, process a line into a database, remove the line, etc

If you don’t empty the variables that hold it temporarily then yes, you end up with multiple copies and can run into memory problems.

P.S. You might want to use Thomas Templeman’s multi-processing template to split the work into separate helper apps that use all the available cores on your computer.

http://www.tempel.org/RB/MultiProcessing

This will also help with the memory problem.

ReadLine is going to be slow. I would read a chunk (via BinaryStream per Dave) and split that on EndOfLine. The last element of the resulting array is going to be the “leftovers” - that part of the chunk that was a partial line. Back up the stream’s location by the length of that element before doing the next read.

And be aware that BinaryStream deals in bytes, not characters. Code accordingly.

331 megs doesn’t seem like all that large of a text file, but reading it in chunks is a pretty good way to go.

I’m not sure how XoJo threads* work, but you might want to have a “chunk” reader in a thread that services your main program. Then you can muck about with the sizes and such without affecting your main logic.

-Paul

*If you are on Mac or Linux, you can simply have two processes doing that work, and use a pipe or shared memory area.

-Paul

It can be done but is a pain in the rear

(grin) It is a Unix / Mainframe mindset I guess. Xoxo does have Stdin and Stdout, though I have not tried to use them. Except with console apps of course. They work just fine there.

Or, he could just mmap the file and not read it at all… :slight_smile:

-Paul

mmap requires declares to get at the functions

again - possible - juts not trivial to do and do right

These are all terrific ideas, I’m going to give most of them a try and see which one works best for this application.

Thanks!!!

  • John

300MB isn’t really large. When Xojo crashes for such a file it’s usually your code :slight_smile: What I managed to do is to have one single 32bit integer among 64 bit integers creating an overflow. Then Xojo crashed when reading a negative number of bytes.

Please post your crash log if your crashes persist.

[quote=201613:@Beatrix Willius]300MB isn’t really large. When Xojo crashes for such a file it’s usually your code :slight_smile: What I managed to do is to have one single 32bit integer among 64 bit integers creating an overflow. Then Xojo crashed when reading a negative number of bytes.

Please post your crash log if your crashes persist.[/quote]

There really isn’t any code other than what I posted, if it persists using other methods I’ll definitely post the crash log here and probably file a bug report, too.

Thanks -

Out of curiosity: what are the files? I’m dealing with proteomics files in my app …

[quote=201554:@John McKernon]I have an app that processes data in plain vanilla text files. I’ve been parsing the data using this code:

dim FileChunk as string
dim FileData(-1) as string

FileChunk = textinput.ReadAll 'Read the data from the file
FileData=split(FileChunk, EndOfLine.Windows) 'Split it into rows
[/quote]

I tried to reproduce your error. I could not get a crash until I got to a 1.8 GB file.

However, in doing so, I spotted a potential space saver :

dim FileData(-1) as string FileData=split(textinput.ReadAll, EndOfLine.Windows) 'Split it into rows

You save the room needed for FileChunk.

I tried Michel’s suggestion for saving space, but the Split command still crashes. My guess is there’s something funky in the content of the file, but I haven’t been able to discover what.

So I extracted the problematic method into a small sample project and filed a Feedback report, it’s #40191.

(FWIW, the file contains lighting information on a Cirque production in Las Vegas, it’s just a bunch of text that’s useless to anyone who doesn’t know how to read it.)

Thanks everyone!