I’m trying to understand how Chunks works. It’s hard for me to understand.
So far, I’m reading text files via TextinputStream and then handing over their contents (lines) to a string array. Then I pass through the array and can parse the contents of each line. This works wonderfully as long as the text files are smaller than 150 MB, with larger files it will be difficult. The performance of the program then increases considerably (RAM utilization, etc.). I also have files that are 669 MB (millions of lines) in size. I can’t read them in this way, because then the RAM usage rises fast to 3-4 GByte. I have read about Chunks several times and that the use of MemoryBlocks is much better and faster. @Eugene Dakin also writes this in his book “I Wish I Knew How To… Use MemoryBlocks with Xojo”. I’m not a professional.
The structure of my files is schematic this:
Objects, recognizable by the fact that the line begins with a 0 (zero). The properties of the objects are then subordinate lines, each starting with a 1 (one). Each object can contain any number of properties. That’s why dividing up Chunks should be difficult, right? Because it could happen that the splitting happens in the middle of an object.
0 My First Object
1 Height=200
1 Width=400
1 Type=4
0 My Second Object
1 Type=2
... etc.
How to use Chunks and MemoryBlocks, and how to design the program’s performance well? I am using the new Xojo framework.
Dim binaryTextData as new memoryblock( theFile.length)
Dim position as Uint64
const chunkSize = 2097152
ts = TextInputStream.Open( file )
while not ts.EOF
binaryTextData.stringValue( position, chunkSize ) = ts.read( chunksize )
position = position + chunksize
wend
ts.Close
return binaryTextData.stringValue( 0, binaryTextData.size)
By using Chunks, how could i show the parsing progress (each line) to with a Progressbar?
What happens with the data read from the file? If you want to keep it in memory completely in one array, you have not solved the memory problem. Does the data go to a database? Or another text file? Or…?
You read a part of the file, process it, release the memory used, read the next part, process that part, release the memory used, etc.
That’s what I do. After parsing line by line, I translate the lines data to an object structure (class) and write the objects and properties to an SQLite-Database. But I collect the SQL-Commands into an array and execute them only one time, after the whole file was read. I know, this SQL-Commands Array will also become very big by a Textfile with millions of lines. But it’s better to execute 1 big SQL-Command then after each Object.
[quote=358913:@Eli Ott]It is. Your problem is, that you run out of memory or because of high memory usage your app gets slow.
You read a part of the file, insert into database, release the memory used, read the next part, insert into database, release the memory used, etc.[/quote]
OK, you’re right. To change the Database thing isn’t difficult for me. But reading the Textfile into Chunks…
I would recommend reading 1MB at a time from the file, process as many records as you can from that chunk, then copy the remainder to another memoryblock and append another 1MB from the file. Repeat as necessary.
What if the chunk splits within an object? Between two properties? I think chunks has to start and end after the last property of an object.[quote=358926:@Tim Hare]I would recommend reading 1MB at a time from the file, process as many records as you can from that chunk, then copy the remainder to another memoryblock and append another 1MB from the file. Repeat as necessary.[/quote]
Thanks for your input Tim could you please give an example? Im a Xojo hobby developer, i dont have any experience with MemoryBlocks.
I don’t understand what you expect from using MemoryBlocks when your problem is that you run out of memory. Like Eli wrote, define a chunk as a number of lines of the input file.
You split the chunks into your “objects” somehow. The last one of chunk will be incomplete. Store it in a variable and concatenate it with the rest of the object which “arrives” at the beginning of the next chunk.
Really, it doesn’t have to be a memoryblock. I only used that as an example because you mentioned them in your post. You can just as easily use a string and split it into an array. Since memory is an issue, we’re going to trade memory for speed/complexity. Abstract your file reading routine into a function that returns a string containing a single object. That will make it easier to adapt into your existing routines.
function GetNextObject(infile as textinputstream) as String
dim instring, theobject, remainder as string
dim temp() as string
dim n as integer
dim found as boolean
static lines() as string
// find the next object
for n = 1 to ubound(lines)
if left(lines(n,1) = "0" then
found = true
exit
end
next
if not found then
// get the next chunk
if infile.EOF then
// nothing more, return what we have
theobject = join(lines, EndOfLine)
return theobject
end
instring = join(lines, EndOfLine)
instring = instring + infile.read(1000000)
lines = split(instring, EndOfLine)
// try it again
found = false
for n = 1 to ubound(lines)
if left(lines(n,1) = "0" then
found = true
exit
end
next
if not found
// nothing more in the file
theobject = join(lines, EndOfLine)
return theobject
end
end
// here only if we found the next object
for i as integer = 0 to n-1
temp.append lines(0)
lines.remove(0)
next
theobject = join(temp, EndOfLine)
return theobject
end function
Notes:
this is forum code, untested, and may require debugging
STATIC allows the lines() array to survive between calls to the function
for n = 1 to ubound(lines) will simply skip the loop if ubound(lines) is less than 1
instead of repeating the code that searches for the next object, this code could be made recursive, which would make it shorter as well as allow for objects that are longer than 1MB.
it doesn’t matter if the chunk contains an incomplete line, the rest of that line will be appended before it gets processed
if left(lines(n,1) ="0" then // wrong
if left(lines(n),1) ="0" then // right
If I now open my text file with a TextInputStream, then logically only the very first object is found. What should the loop for the whole file look like? My loop only matches the first Object:
Dim t As TextInputStream
Try
t = TextInputStream.Open(f)
t.Encoding = Encodings.UTF8
While Not t.EOF
TextArea1.Text = GetNextObject(t)
Wend
Catch e As IOException
MsgBox("Error accessing file.")
End Try
t.Close
I’m just pointing this out, because sometimes it’s better to continue the discussion in the original thread rather trying to bring everyone up to speed with what has already been discussed.
I’m just pointing this out, because sometimes it’s better to continue the discussion in the original thread rather trying to bring everyone up to speed with what has already been discussed.[/quote]
I totally agree. And thanks to your help, I learned a lot in the other post and your code works well for me, too. I can understand him.
However, Tim has made a suggestion in this article that could read the objects in a slightly different way. That is why I have replied here.
One problem was that I left out a bit of cleanup, where after the last object is found, we need to clear the lines() array. Here is a debugged version of the method:
Function GetNextObject(infile as textinputstream) As String
dim instring, theobject, remainder as string
dim temp() as string
dim n as integer
dim found as boolean
static lines() as string
// find the next object
for n = 1 to ubound(lines)
if left(lines(n),1) = "0" then
found = true
exit
end
next
if not found then
// get the next chunk
if infile.EOF then
// nothing more, return what we have
theobject = join(lines, EndOfLine)
redim lines(-1)
return theobject
end
instring = join(lines, EndOfLine)
instring = instring + infile.read(1000000)
lines = split(instring, EndOfLine)
// try it again
found = false
for n = 1 to ubound(lines)
if left(lines(n),1) = "0" then
found = true
exit
end
next
if not found then
// nothing more in the file
theobject = join(lines, EndOfLine)
redim lines(-1)
return theobject
end
end
// here only if we found the next object
for i as integer = 0 to n-1
temp.append lines(0)
lines.remove(0)
next
theobject = join(temp, EndOfLine)
return theobject
End Function
You call it like this
dim s as string
dim a() as string
t= TextInputStream.open(f)
s= GetNextObject(t)
while s <> ""
// process the object somehow
a= s.split(EndOfLine)
listbox1.addrow(a(0))
// get the next one
s= GetNextObject(t)
wend
I could use your help one more time. Tim’s code has worked very well so far, but now I need a variant that works with a BinaryStream instead of TextInputStream. It is important that no TextEncoding is assigned, because I have to manipulate bytes later and therefore only need the object boundaries, i.e. the offset (start position) and the size of the object (both logically as integer values).
So what I need is a method that returns a Pair(start, length) as result:
Function GetNextObjects(bs As BinaryStream) As Pair()
...
How do I have to modify the code for BinaryStream? All three EndOfLine encodings have to be taken into account when reading the code.
[code]Private Function GetNextObjects(t As TextInputStream) As String()
Dim start, position, length As Integer
Dim appended As Boolean
Static rest As String
Dim result() As String
Static eol As String = EndOfLine + “0”
Const kEmpty = “”
Dim buffer As String = rest + ConvertEncoding(t.Read(4096), Encodings.UTF8)
length = buffer.LenB
position = 2
If length > 0 Then
Do
appended = False
position = buffer.InStr(position, eol)
If position > 0 Then
result.Append(buffer.Mid(start, position - start))
start = position + 1
position = position + 2
appended = True
End If
Loop Until Not appended Or position = length
If start < length - 1 Then
rest = buffer.Mid(start)
Else
rest = kEmpty
End If
If t.EOF And rest <> kEmpty Then
result.Append(rest)
rest = kEmpty
End If
Return result