Chunk Algorithm Question

Martin_T · January 5, 2018, 11:00pm

Hey everyone,

i have the following code

[code]Using Xojo.IO

Dim f As FolderItem = SpecialFolder.Desktop.Child(“test.txt”)
Dim b As BinaryStream = BinaryStream.Open(f, BinaryStream.LockModes.Read)
Dim enc As TextEncoding = GetTextEncoding(b)

Dim txt As Text = b.ReadText(b.Length, enc)

// here should the chunk be inserted

b.Close[/code]
I have a very large Textfile. I want to read the file into chunks and then into a line array. But its important that the chunk length will always go to the end of a line. So i have to find out if b.ReadText(xxx, enc) will be followed by EndOLine bytes. Otherwise xxx needs to grow until 1 position before the next EndOfLine byte. How to do this? So ill need a dynamic read Buffer size.

Robert_Weaver · January 5, 2018, 11:49pm

What I generally do is read the fixed size chunk without regard to where the line endings occur, and then use either the split() or countFields() and nthField() functions (using the EndOfLine as delimiter) to process the text. The last line is assumed to be incomplete, and so it is not processed, but saved and prefixed to the start of the next chunk to be read. I should have some code lying around somewhere that I can post.

Martin_T · January 5, 2018, 11:55pm

Hi Robert, this sounds similar to this what im looking for. Glad if youll find a snippet.

Robert_Weaver · January 6, 2018, 12:17am

The only example I could find was quite messy, so I’ve recoded it here but haven’t tested it. So some debugging may be necessary.

Public Sub ReadChunks(txtIn as TextInputStream) Const chunkSize=5000 dim residual As String = "" While not txtIn.EOF dim tArray() As string = split(txtIn.Read(chunkSize),EndOfLine) tArray(0)=residual+tArray(0) 'insert residual text of incomplete line from previous read dim nLines As Integer = UBound(tArray)-1 residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read for i As integer = 0 to nLines 'Process complete lines of text here Next Wend 'If file doesn't end with an EndOfLine, then the residual string 'will contain the final partial line which still needs to be processed. if residual<>"" then 'Process final partial line of text end if End Sub

Martin_T · January 6, 2018, 12:58am

The version you posted works perfect. But after translation into the new Framework I got an exception:

As you can see, I like to use the new Xojo Framework.

Thank you

Robert_Weaver · January 6, 2018, 1:05am

I don’t use the new framework, because I’ve had too many problems with it when dealing with strings and text. So, I won’t be much help in locating the problem. Hopefully, someone else will chime in.

There is one other point that I should mention. The above code should work fine on Mac and Linux, but because Windows uses a two character line ending, you could get unexpected behaviour on Windows if the file read happens to break halfway through the line ending (i.e., the read includes the first character of the line ending, but not the second). What I would do in that case, is not use EndOfLine as the delimiter, but chr(10), linefeed, instead, and then do a ReplaceAll() on each line of text to delete the carriage return characters: chr(13).

Martin_T · January 6, 2018, 1:12am

Ill look forward, if someone else can help. Here is the new Framework version:

[code]Public Sub ReadChunks(txtIn As Xojo.IO.TextInputStream)
Const chunkSize = 5000
Dim residual As Text = “”
Dim a As Integer = 1

While Not txtIn.EOF
Dim tArray() As Text = txtIn.Read(chunkSize).Split(&uA) ’ IO Error
’ insert residual text of incomplete line from previous read
tArray(0) = residual + tArray(0)
Dim nLines As Integer = tArray.Ubound - 1
’ save incomplete line at end of current chunk for next read
residual=tArray(nLines+1)

For i As Integer = 0 to nLines
’ Process complete lines of text here
a = a + 1
Next

MsgBox residual + " " + a.ToText
Wend[/code]

Robert_Weaver · January 6, 2018, 1:22am

Do you know for sure that the exception is occurring on this line?

Dim tArray() As Text = txtIn.Read(chunkSize).Split(&uA) ' IO Error

Also, note that I edited the my original code to account for the situation where the file doesn’t have an EndOfLine as the last character: I added the part after the Wend statement, but that shouldn’t affect the error that you’re getting.

Martin_T · January 6, 2018, 1:35am

I dont know, but thats line where the debugger jumps to (Message: Error reading).

Robert_Weaver · January 6, 2018, 1:42am

Well, that would seem to confirm that to be the bad line. I wonder if it’s somehow trying to read past the end of file, and that’s throwing the exception. Do your diagnostics show how many file reads occurred before the exception?

When you said that my code worked, does that mean that you tested initially using the old framework?

BTW, The following revision should eliminate the Windows two character line ending issue, and allow you to use the proper EndOfLine rather than &uA, making it more reliable across different platforms. This is still old framework though.

Public Sub ReadChunks(txtIn as TextInputStream) Const chunkSize=5000 dim residual As String = "" While not txtIn.EOF 'residual string is prefixed to front of txtIn.Read 'which restores the possibly split Windows EndOfLine dim tArray() As string = split(residual+txtIn.Read(chunkSize),EndOfLine) dim nLines As Integer = UBound(tArray)-1 residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read for i As integer = 0 to nLines 'Process complete lines of text here Next Wend 'If file doesn't end with an EndOfLine, then the residual string 'will contain the final partial line which needs to be processed. if residual<>"" then 'Process final incomplete line of text here end if End Sub

Martin_T · January 6, 2018, 1:57am

After analyzing, it looks like the Error appear during the last loop. Weird!

Yes I do

I think, you can optimize it a little bit by using ReDim for the Arrays.

Public Sub ReadChunks(txtIn as TextInputStream) Const chunkSize=5000 dim residual As String = "" dim tArray() As String While not txtIn.EOF 'residual string is prefixed to front of txtIn.Read 'which restores the possibly split Windows EndOfLine tArray = split(residual+txtIn.Read(chunkSize),EndOfLine) dim nLines As Integer = UBound(tArray)-1 residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read for i As integer = 0 to nLines 'Process complete lines of text here Next Wend 'If file doesn't end with an EndOfLine, then the residual string 'will contain the final partial line which needs to be processed. if residual<>"" then 'Process final incomplete line of text here end if ReDim tArray(-1) End Sub

Robert_Weaver · January 6, 2018, 2:01am

Looks like it’s trying to read past the end of file then. This shouldn’t throw an exception though. It should just return zero characters. Maybe a Xojo bug then.

Robert_Weaver · January 6, 2018, 2:12am

Checking Xojo.IO.TextInputStream in the Language reference, it gives two possible causes of an exception:

IOException - If there is not enough memory available or the stream is not open.

Obviously, the stream must be open or it wouldn’t get to the last loop.
I wonder if repeatedly assigning the new input to tArray is somehow not allowing the program to free up the memory used by the array. Maybe it would be a good idea to dimension the array at the beginning of the code, and then put a redim immediately before the input read. Like this:

Public Sub ReadChunks(txtIn as TextInputStream) Const chunkSize=5000 dim tArray() As string dim residual As String = "" While not txtIn.EOF 'residual string is prefixed to front of txtIn.Read 'which restores the possibly split Windows EndOfLine redim tArray(-1) tArray = split(residual+txtIn.Read(chunkSize),EndOfLine) dim nLines As Integer = UBound(tArray)-1 residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read for i As integer = 0 to nLines 'Process complete lines of text here Next Wend 'If file doesn't end with an EndOfLine, then the residual string 'will contain the final partial line which needs to be processed. if residual<>"" then 'Process final incomplete line of text here end if End Sub

Martin_T · January 6, 2018, 2:26am

Still the same Error, but Robert I’m really amazed about the speed of your algorithm. Wow, I was never able to read a 665,7 MB file. OK at this point I didn’t parsed anything, but when I read the whole file into a TextInputStream and split it into a Lines Array my system freezed completely and needed to make a hard restart on my Mac. Looks like Chunks are a good thing

Robert_Weaver · January 6, 2018, 2:46am

Then this looks like a possible Xojo bug. It would be good to get some input from someone with experience using Xojo.IO.TextInputStream though.

Markus_Winter · January 6, 2018, 12:43pm

A 700 MB text isn’t THAT big. Why not read it in in one go and split it at the line endings?

Martin_T · January 6, 2018, 2:47pm

Hello Markus,
You’re right. However, it has been my experience that the memory and CPU of my old MacBook Pro from 2009, by splitting the entire TextInputStream into an array of lines, was very heavily loaded and I didn’t have enough resources to parse each line, because it creates its own large object structure. This resulted in the total freezing of the operating system and could only be fixed by switching off and on the MacBook manually. Not the nice kind.

It was a 64Bit.

Markus_Winter · January 6, 2018, 2:52pm

Was that with 32bit or 64bit though?

Also try a small file with just a few lines and a small chunk size first. Do you still see the error? If yes then it is not the size.

Markus_Winter · January 6, 2018, 3:05pm

Also “The last line is assumed to be incomplete” might not be true for all chunks.

Martin_T · January 6, 2018, 3:10pm

You mean for the new Xojo-Framework Translation? I tried it with small Chunk size of 50 and everytime I got the error message for the line tArray(0) = txtln.Read(chunkSize).Split(&uA).