Dim f As FolderItem = SpecialFolder.Desktop.Child(“test.txt”)
Dim b As BinaryStream = BinaryStream.Open(f, BinaryStream.LockModes.Read)
Dim enc As TextEncoding = GetTextEncoding(b)
Dim txt As Text = b.ReadText(b.Length, enc)
// here should the chunk be inserted
b.Close[/code]
I have a very large Textfile. I want to read the file into chunks and then into a line array. But its important that the chunk length will always go to the end of a line. So i have to find out if b.ReadText(xxx, enc) will be followed by EndOLine bytes. Otherwise xxx needs to grow until 1 position before the next EndOfLine byte. How to do this? So ill need a dynamic read Buffer size.
What I generally do is read the fixed size chunk without regard to where the line endings occur, and then use either the split() or countFields() and nthField() functions (using the EndOfLine as delimiter) to process the text. The last line is assumed to be incomplete, and so it is not processed, but saved and prefixed to the start of the next chunk to be read. I should have some code lying around somewhere that I can post.
The only example I could find was quite messy, so I’ve recoded it here but haven’t tested it. So some debugging may be necessary.
Public Sub ReadChunks(txtIn as TextInputStream)
Const chunkSize=5000
dim residual As String = ""
While not txtIn.EOF
dim tArray() As string = split(txtIn.Read(chunkSize),EndOfLine)
tArray(0)=residual+tArray(0) 'insert residual text of incomplete line from previous read
dim nLines As Integer = UBound(tArray)-1
residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read
for i As integer = 0 to nLines
'Process complete lines of text here
Next
Wend
'If file doesn't end with an EndOfLine, then the residual string
'will contain the final partial line which still needs to be processed.
if residual<>"" then
'Process final partial line of text
end if
End Sub
I don’t use the new framework, because I’ve had too many problems with it when dealing with strings and text. So, I won’t be much help in locating the problem. Hopefully, someone else will chime in.
There is one other point that I should mention. The above code should work fine on Mac and Linux, but because Windows uses a two character line ending, you could get unexpected behaviour on Windows if the file read happens to break halfway through the line ending (i.e., the read includes the first character of the line ending, but not the second). What I would do in that case, is not use EndOfLine as the delimiter, but chr(10), linefeed, instead, and then do a ReplaceAll() on each line of text to delete the carriage return characters: chr(13).
Ill look forward, if someone else can help. Here is the new Framework version:
[code]Public Sub ReadChunks(txtIn As Xojo.IO.TextInputStream)
Const chunkSize = 5000
Dim residual As Text = “”
Dim a As Integer = 1
While Not txtIn.EOF
Dim tArray() As Text = txtIn.Read(chunkSize).Split(&uA) ’ IO Error
’ insert residual text of incomplete line from previous read
tArray(0) = residual + tArray(0)
Dim nLines As Integer = tArray.Ubound - 1
’ save incomplete line at end of current chunk for next read
residual=tArray(nLines+1)
For i As Integer = 0 to nLines
’ Process complete lines of text here
a = a + 1
Next
Do you know for sure that the exception is occurring on this line?
Dim tArray() As Text = txtIn.Read(chunkSize).Split(&uA) ' IO Error
Also, note that I edited the my original code to account for the situation where the file doesn’t have an EndOfLine as the last character: I added the part after the Wend statement, but that shouldn’t affect the error that you’re getting.
Well, that would seem to confirm that to be the bad line. I wonder if it’s somehow trying to read past the end of file, and that’s throwing the exception. Do your diagnostics show how many file reads occurred before the exception?
When you said that my code worked, does that mean that you tested initially using the old framework?
BTW, The following revision should eliminate the Windows two character line ending issue, and allow you to use the proper EndOfLine rather than &uA, making it more reliable across different platforms. This is still old framework though.
Public Sub ReadChunks(txtIn as TextInputStream)
Const chunkSize=5000
dim residual As String = ""
While not txtIn.EOF
'residual string is prefixed to front of txtIn.Read
'which restores the possibly split Windows EndOfLine
dim tArray() As string = split(residual+txtIn.Read(chunkSize),EndOfLine)
dim nLines As Integer = UBound(tArray)-1
residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read
for i As integer = 0 to nLines
'Process complete lines of text here
Next
Wend
'If file doesn't end with an EndOfLine, then the residual string
'will contain the final partial line which needs to be processed.
if residual<>"" then
'Process final incomplete line of text here
end if
End Sub
After analyzing, it looks like the Error appear during the last loop. Weird!
Yes I do
I think, you can optimize it a little bit by using ReDim for the Arrays.
Public Sub ReadChunks(txtIn as TextInputStream)
Const chunkSize=5000
dim residual As String = ""
dim tArray() As String
While not txtIn.EOF
'residual string is prefixed to front of txtIn.Read
'which restores the possibly split Windows EndOfLine
tArray = split(residual+txtIn.Read(chunkSize),EndOfLine)
dim nLines As Integer = UBound(tArray)-1
residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read
for i As integer = 0 to nLines
'Process complete lines of text here
Next
Wend
'If file doesn't end with an EndOfLine, then the residual string
'will contain the final partial line which needs to be processed.
if residual<>"" then
'Process final incomplete line of text here
end if
ReDim tArray(-1)
End Sub
Looks like it’s trying to read past the end of file then. This shouldn’t throw an exception though. It should just return zero characters. Maybe a Xojo bug then.
Checking Xojo.IO.TextInputStream in the Language reference, it gives two possible causes of an exception:
IOException - If there is not enough memory available or the stream is not open.
Obviously, the stream must be open or it wouldn’t get to the last loop.
I wonder if repeatedly assigning the new input to tArray is somehow not allowing the program to free up the memory used by the array. Maybe it would be a good idea to dimension the array at the beginning of the code, and then put a redim immediately before the input read. Like this:
Public Sub ReadChunks(txtIn as TextInputStream)
Const chunkSize=5000
dim tArray() As string
dim residual As String = ""
While not txtIn.EOF
'residual string is prefixed to front of txtIn.Read
'which restores the possibly split Windows EndOfLine
redim tArray(-1)
tArray = split(residual+txtIn.Read(chunkSize),EndOfLine)
dim nLines As Integer = UBound(tArray)-1
residual=tArray(nLines+1) 'save incomplete line at end of current chunk for next read
for i As integer = 0 to nLines
'Process complete lines of text here
Next
Wend
'If file doesn't end with an EndOfLine, then the residual string
'will contain the final partial line which needs to be processed.
if residual<>"" then
'Process final incomplete line of text here
end if
End Sub
Still the same Error, but Robert I’m really amazed about the speed of your algorithm. Wow, I was never able to read a 665,7 MB file. OK at this point I didn’t parsed anything, but when I read the whole file into a TextInputStream and split it into a Lines Array my system freezed completely and needed to make a hard restart on my Mac. Looks like Chunks are a good thing
Hello Markus,
You’re right. However, it has been my experience that the memory and CPU of my old MacBook Pro from 2009, by splitting the entire TextInputStream into an array of lines, was very heavily loaded and I didn’t have enough resources to parse each line, because it creates its own large object structure. This resulted in the total freezing of the operating system and could only be fixed by switching off and on the MacBook manually. Not the nice kind.
You mean for the new Xojo-Framework Translation? I tried it with small Chunk size of 50 and everytime I got the error message for the line tArray(0) = txtln.Read(chunkSize).Split(&uA).