Parsing a VERY large text file?

Markus_Winter · July 21, 2015, 8:05pm

Maybe you can make the file available so others can have a look at it?

Kem_Tekinay · July 21, 2015, 8:05pm

FYI, you can use the icon at the bottom of the Feedback report to copy a direct link, then paste that link, unmodified, here.

For everyone’s convenience:

<https://xojo.com/issue/40191>

Kem_Tekinay · July 21, 2015, 8:06pm

Also, it turns out that your zipped file will be zipped again, so don’t zip it first.

John_McKernon · July 21, 2015, 8:39pm

Ah! I learn something every day!

I just wanted to make sure the .txt file wasn’t corrupted by whatever process they upload.

Tim_Hare · July 21, 2015, 9:24pm

The key term in Michel’s post was potential space-saver. I doubt that it saves any space at all, as the string still has to be read into (temporary) memory in order to split it.

Have you tried taking it in smaller chunks? A few MB at a time?

Michel_Bujardet · July 21, 2015, 10:01pm

The Mac decompresses both in one pass.

Michel_Bujardet · July 21, 2015, 10:27pm

I just tried loading the text file attached to the bug report. It does crash here in a project I was not able to crash under 1.8 GB.

Since the endofline is Windows, I tried to load it in Word Windows that I know can load any size of file. I generates an error.

I strongly suspect that file size is not the only issue, but maybe some control characters.

Tim_Hare · July 22, 2015, 12:12am

Michel, I think the problem is with the resulting array size. There are many, many lines that contain just “0”, so the array gets extremely large.

I read the entire thing and split choked. But I was able to read it in 1MB chunks and split each one with no problem.

Michel_Bujardet · July 22, 2015, 12:36am

Were you able to come up with one big array with all the lines ? It looks as we may have found the UBound limit…

Tim_Hare · July 22, 2015, 2:10am

I’ll try that.

Tim_Hare · July 22, 2015, 2:38am

I tried appending to one large array and it crashed. The funny thing is, it ran out of memory. If I simply read the file in 1MB chunks, without splitting, the app uses 6MB total. If I split each chunk, but do nothing else, it stabilizes at 27MB of memory. If I try to append the elements of each array into one big array, it hits 2.6GB and the system becomes unresponsive until it crashes. It gets up to about 52,000,000 elements in the array, and about 290GB of the file. After that, I don’t know what goes on. I’ll do some more testing.

Tim_Hare · July 22, 2015, 2:49am

It looks like each 1MB from the file, split and then appended to one main array, adds about 10MB to the memory usage of the app. So a 320MB file is going to max out memory for a 32-bit app on Windows.

Tim_Hare · July 22, 2015, 2:51am

I changed it up and just appended a 20-byte string to an array. When it hit 52,000,000 elements in the array, the app was only using 209MB of memory. There seems to be something about Split() here.

Kem_Tekinay · July 22, 2015, 4:04am

My guess is that we are witnessing the RAM limitations of 32-bit and trying this in 64-bit would not lead to a crash. It might take an awfully, awfully long while, so long that you might consider gouging your eyes out with a hairpin just to pass the time, and would eventually force-quit it just to help roll back your blood pressure, but it wouldn’t crash.

At least, that’s how I imagine it would be an a world without NDA’s…

Paul_Raulerson1 · July 22, 2015, 5:08am

This was interesting. I scanned the file to check for any odd characters causing problems, and it is just a normal Unicode UTF-8 text file with CRLF line terminators in it.

On my machine, Xojo will fail out around 54,842,500 records.

I modified the program to use TextInput.ReadLine and just appended each line to the String array. That made things even more interesting, as the exact point it would crash was variable, from 54,831,003 records to 54,842,500.

That led me to believe that the issue is related somehow to the way Xojo is allocating memory, and I believe that may be exactly the case. It looks to me (grabbing the program with gdb and getting all confused…) that Xojo is reallocating the entire in memory array watch time it needs more space for it. That may turn out to not be true however, I didn’t go after this professionally, just for the fun and challenge of it.

Oh, I would suggest writing the data out to a new file, or even to a small database. It would be a little more time consuming, but it will also work and give you access to the information in the file.

Debug thread below. Interested if anyone else can verify it, should they have the interest and time. Also the folks here with much better familiarity to Xojo than I have may ferret out another reason, and this may turn out to a red herring.

Yours,
-Paul

[code]Crashed Thread: 0 Dispatch queue: com.apple.main-thread

Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000

Application Specific Information:
abort() called
terminating with uncaught exception of type std::bad_alloc: std::bad_alloc

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x9c45c69a __pthread_kill + 10
1 libsystem_pthread.dylib 0x95c4ff19 pthread_kill + 101
2 libsystem_c.dylib 0x9bb88eee abort + 156
3 libc++abi.dylib 0x923622f9 abort_message + 169
4 libc++abi.dylib 0x92385483 default_terminate_handler() + 272
5 libc++abi.dylib 0x92382ac0 std::__terminate(void (*)()) + 14
6 libc++abi.dylib 0x923824db __cxa_throw + 122
7 libc++.1.dylib 0x90dc1ac6 operator new(unsigned long) + 102
8 com.xojo.XojoFramework 0x002c2fca 0x143000 + 1572810
9 com.mckernon.splitcrashesapp 0x000f7d2b Window1.Window1.SplitTheFile%%o<Window1.Window1>o + 1020
10 com.mckernon.splitcrashesapp 0x000f786f Window1.Window1._OpenItem_Action%b%o<Window1.Window1> + 257[/code]

Tim_Hare · July 22, 2015, 5:16am

The error here was the fact that I was using the same string. Xojo can conserve space by copying a reference to the string in the array, not the entire string. (Strings are immutable, so it is valid to do so.) I changed it to constructing a 20 byte string, so each one was a unique string in memory, and the memory use shot up.

So the issue is has nothing to do with split(). It’s purely the fact that an array of strings takes a lot more space than the actual strings themselves.

For short strings, such as in the OP’s file, the array overhead becomes severe.

Beatrix_Willius · July 22, 2015, 8:26am

Read your file in pieces. Here is a part of the code that I use to successfully read 10 GB files:

[quote]
do
'check how much data needs to be read
if Globals.StopArchiving then return 0 'parsing cancelled by user
dim FilePosition as Int64 = inputBinary.Position - LenB(lastLine)
LeftToRead = min(110241024, FileLength - FileAlreadyRead)
FileAlreadyRead = FileAlreadyRead + LeftToRead
if FileAlreadyRead = FileLength then FileIsRead = true
if FileIsRead then
mboxData = lastLine + inputBinary.Read(leftToRead)
else
mboxData = lastLine + inputBinary.Read(leftToRead - ReadLeftOver)
dim theRight as String = inputBinary.Read(ReadLeftOver)
dim theRightSplit as Pair = SplitLine(theRight)
lastLine = theRightSplit.Left
mboxData = mboxData + theRightSplit.Right
end if

'now do something with the result
loop until FileIsRead[/quote]

The principle is to read 1MB and then the rest of a line until I get one of the EndOfLine characters. This is for reading mbox files.

Michel_Bujardet · July 22, 2015, 8:34am

[quote=201815:@Paul Raulerson]On my machine, Xojo will fail out around 54,842,500 records.

I modified the program to use TextInput.ReadLine and just appended each line to the String array. That made things even more interesting, as the exact point it would crash was variable, from 54,831,003 records to 54,842,500.[/quote]

I saw just the same here. It crashes the same with ReadLine.

Seems indeed there is a practical limit to how many elements a string array can have.

Coming back to the OP question :

I have no idea about how you are using/displaying the data but Norman’s idea of putting the data in an SQiLte database seems the most robust, if not the fastest one.

Apart from that, I would just try cutting parts of the big string for parse and use in pages.

John_McKernon · July 22, 2015, 1:59pm

Wow, I had no idea how complicated this turned out to be. Xoxo has confirmed the problem, but if the problem is indeed the way string arrays are stored, then there’s not likely to be a real solution.

Although this chunk of data was an anomaly, it looks like I’ll be moving my storage to a DB, which may create speed issues, but at least the whole app won’t come crashing down.

Thank you all for digging into this, you’re an amazing group of people!