Chunks & MemoryBlocks - How to?

Martin_T · November 12, 2017, 3:53pm

I’m trying to understand how Chunks works. It’s hard for me to understand.

So far, I’m reading text files via TextinputStream and then handing over their contents (lines) to a string array. Then I pass through the array and can parse the contents of each line. This works wonderfully as long as the text files are smaller than 150 MB, with larger files it will be difficult. The performance of the program then increases considerably (RAM utilization, etc.). I also have files that are 669 MB (millions of lines) in size. I can’t read them in this way, because then the RAM usage rises fast to 3-4 GByte. I have read about Chunks several times and that the use of MemoryBlocks is much better and faster. @Eugene Dakin also writes this in his book “I Wish I Knew How To… Use MemoryBlocks with Xojo”. I’m not a professional.

The structure of my files is schematic this:

Objects, recognizable by the fact that the line begins with a 0 (zero). The properties of the objects are then subordinate lines, each starting with a 1 (one). Each object can contain any number of properties. That’s why dividing up Chunks should be difficult, right? Because it could happen that the splitting happens in the middle of an object.

0 My First Object 1 Height=200 1 Width=400 1 Type=4 0 My Second Object 1 Type=2 ... etc.

How to use Chunks and MemoryBlocks, and how to design the program’s performance well? I am using the new Xojo framework.

@Sam Rowlands gave me this code:

Dim binaryTextData as new memoryblock( theFile.length) Dim position as Uint64 const chunkSize = 2097152 ts = TextInputStream.Open( file ) while not ts.EOF binaryTextData.stringValue( position, chunkSize ) = ts.read( chunksize ) position = position + chunksize wend ts.Close return binaryTextData.stringValue( 0, binaryTextData.size)

By using Chunks, how could i show the parsing progress (each line) to with a Progressbar?

Eli_Ott · November 12, 2017, 4:27pm

What happens with the data read from the file? If you want to keep it in memory completely in one array, you have not solved the memory problem. Does the data go to a database? Or another text file? Or…?

You read a part of the file, process it, release the memory used, read the next part, process that part, release the memory used, etc.

Eli_Ott · November 12, 2017, 4:29pm

The fastest is to use the old framework (TextInputStream.Read or …ReadLine or …ReadAll).

Martin_T · November 12, 2017, 4:34pm

That’s what I do. After parsing line by line, I translate the lines data to an object structure (class) and write the objects and properties to an SQLite-Database. But I collect the SQL-Commands into an array and execute them only one time, after the whole file was read. I know, this SQL-Commands Array will also become very big by a Textfile with millions of lines. But it’s better to execute 1 big SQL-Command then after each Object.

Eli_Ott · November 12, 2017, 4:47pm

I don’t think this is better. I’d update the database in “chunks”, like after each 1000 or 5000 record.

Martin_T · November 12, 2017, 4:59pm

That’s easy for me to realize. But it’s not the main topic

Eli_Ott · November 12, 2017, 5:02pm

It is. Your problem is, that you run out of memory or because of high memory usage your app gets slow.

You read a part of the file, insert into database, release the memory used, read the next part, insert into database, release the memory used, etc.

Martin_T · November 12, 2017, 5:10pm

[quote=358913:@Eli Ott]It is. Your problem is, that you run out of memory or because of high memory usage your app gets slow.

You read a part of the file, insert into database, release the memory used, read the next part, insert into database, release the memory used, etc.[/quote]
OK, you’re right. To change the Database thing isn’t difficult for me. But reading the Textfile into Chunks…

Eli_Ott · November 12, 2017, 5:19pm

TextInputStream.Read(count As Integer) // reads count of bytes from the current position

[code]Dim TextInputStream As tis = TextInputStream.Open(file)

Do Until tis.EOF
Dim s As String = tis.Read(chunkSize)
// process to database
Loop

tis.Close()[/code]

Tim_Hare · November 12, 2017, 6:46pm

I would recommend reading 1MB at a time from the file, process as many records as you can from that chunk, then copy the remainder to another memoryblock and append another 1MB from the file. Repeat as necessary.

Martin_T · November 12, 2017, 6:51pm

What if the chunk splits within an object? Between two properties? I think chunks has to start and end after the last property of an object.[quote=358926:@Tim Hare]I would recommend reading 1MB at a time from the file, process as many records as you can from that chunk, then copy the remainder to another memoryblock and append another 1MB from the file. Repeat as necessary.[/quote]
Thanks for your input Tim could you please give an example? Im a Xojo hobby developer, i dont have any experience with MemoryBlocks.

Carsten_Belling · November 12, 2017, 6:56pm

I don’t understand what you expect from using MemoryBlocks when your problem is that you run out of memory. Like Eli wrote, define a chunk as a number of lines of the input file.

Eli_Ott · November 12, 2017, 7:25pm

You split the chunks into your “objects” somehow. The last one of chunk will be incomplete. Store it in a variable and concatenate it with the rest of the object which “arrives” at the beginning of the next chunk.

Tim_Hare · November 12, 2017, 7:27pm

Really, it doesn’t have to be a memoryblock. I only used that as an example because you mentioned them in your post. You can just as easily use a string and split it into an array. Since memory is an issue, we’re going to trade memory for speed/complexity. Abstract your file reading routine into a function that returns a string containing a single object. That will make it easier to adapt into your existing routines.

function GetNextObject(infile as textinputstream) as String
   dim instring, theobject, remainder as string
   dim temp() as string
   dim n as integer
   dim found as boolean
   static lines() as string

   // find the next object
   for n = 1 to ubound(lines)
       if left(lines(n,1) = "0" then
          found = true
          exit
      end
   next

   if not found then
      // get the next chunk
      if infile.EOF then
         // nothing more, return what we have
         theobject = join(lines, EndOfLine)
         return theobject
      end
      instring =  join(lines, EndOfLine)
      instring = instring + infile.read(1000000)
      lines = split(instring, EndOfLine)
      // try it again
      found = false
      for n = 1 to ubound(lines)
         if left(lines(n,1) = "0" then
            found = true
            exit
         end
      next
      if not found
         // nothing more in the file
         theobject = join(lines, EndOfLine)
         return theobject
      end
   end

   // here only if we found the next object
   for i as integer = 0 to n-1
      temp.append lines(0)
      lines.remove(0)
   next
   theobject = join(temp, EndOfLine)
   return theobject
end function

Notes:

this is forum code, untested, and may require debugging
STATIC allows the lines() array to survive between calls to the function
for n = 1 to ubound(lines) will simply skip the loop if ubound(lines) is less than 1
instead of repeating the code that searches for the next object, this code could be made recursive, which would make it shorter as well as allow for objects that are longer than 1MB.
it doesn’t matter if the chunk contains an incomplete line, the rest of that line will be appended before it gets processed

Martin_T · April 28, 2018, 11:20pm

Thank you @Tim Hare,

in their code were just two little typos:

if left(lines(n,1) ="0" then // wrong if left(lines(n),1) ="0" then // right
If I now open my text file with a TextInputStream, then logically only the very first object is found. What should the loop for the whole file look like? My loop only matches the first Object:

Dim t As TextInputStream Try t = TextInputStream.Open(f) t.Encoding = Encodings.UTF8 While Not t.EOF TextArea1.Text = GetNextObject(t) Wend Catch e As IOException MsgBox("Error accessing file.") End Try t.Close

Robert_Weaver · April 29, 2018, 12:43am

FYI, this (or a minor variation of it) was previously asked and discussed in this thread:
https://forum.xojo.com/45264-chunk-algorithm-question

I’m just pointing this out, because sometimes it’s better to continue the discussion in the original thread rather trying to bring everyone up to speed with what has already been discussed.

Martin_T · April 29, 2018, 12:57am

[quote=385302:]FYI, this (or a minor variation of it) was previously asked and discussed in this thread:
https://forum.xojo.com/45264-chunk-algorithm-question

I’m just pointing this out, because sometimes it’s better to continue the discussion in the original thread rather trying to bring everyone up to speed with what has already been discussed.[/quote]
I totally agree. And thanks to your help, I learned a lot in the other post and your code works well for me, too. I can understand him.
However, Tim has made a suggestion in this article that could read the objects in a slightly different way. That is why I have replied here.

Tim_Hare · May 3, 2018, 9:08pm

One problem was that I left out a bit of cleanup, where after the last object is found, we need to clear the lines() array. Here is a debugged version of the method:

Function GetNextObject(infile as textinputstream) As String
  dim instring, theobject, remainder as string
  dim temp() as string
  dim n as integer
  dim found as boolean
  static lines() as string
  
  // find the next object
  for n = 1 to ubound(lines)
    if left(lines(n),1) = "0" then
      found = true
      exit
    end
  next
  
  if not found then
    // get the next chunk
    if infile.EOF then
      // nothing more, return what we have
      theobject = join(lines, EndOfLine)
      redim lines(-1)
      return theobject
    end
    instring =  join(lines, EndOfLine)
    instring = instring + infile.read(1000000)
    lines = split(instring, EndOfLine)
    // try it again
    found = false
    for n = 1 to ubound(lines)
      if left(lines(n),1) = "0" then
        found = true
        exit
      end
    next
    if not found then
      // nothing more in the file
      theobject = join(lines, EndOfLine)
      redim lines(-1)
      return theobject
    end
  end
  
  // here only if we found the next object
  for i as integer = 0 to n-1
    temp.append lines(0)
    lines.remove(0)
  next
  theobject = join(temp, EndOfLine)
  return theobject
End Function

You call it like this

  dim s as string
  dim a() as string
  
  t= TextInputStream.open(f)
  
  s= GetNextObject(t)
  while s <> ""
    // process the object somehow
    a= s.split(EndOfLine)
    listbox1.addrow(a(0))
    // get the next one
    s= GetNextObject(t)
  wend

Martin_T · May 3, 2018, 10:15pm

Thank you, @Tim Hare . The code is easy to understand.

Martin_T · June 25, 2020, 1:28pm

Hello Tim and everyone else!

I could use your help one more time. Tim’s code has worked very well so far, but now I need a variant that works with a BinaryStream instead of TextInputStream. It is important that no TextEncoding is assigned, because I have to manipulate bytes later and therefore only need the object boundaries, i.e. the offset (start position) and the size of the object (both logically as integer values).

So what I need is a method that returns a Pair(start, length) as result:

Function GetNextObjects(bs As BinaryStream) As Pair() ...

How do I have to modify the code for BinaryStream? All three EndOfLine encodings have to be taken into account when reading the code.

[code]Private Function GetNextObjects(t As TextInputStream) As String()
Dim start, position, length As Integer
Dim appended As Boolean
Static rest As String
Dim result() As String
Static eol As String = EndOfLine + “0”
Const kEmpty = “”
Dim buffer As String = rest + ConvertEncoding(t.Read(4096), Encodings.UTF8)

length = buffer.LenB
position = 2

If length > 0 Then

Do
appended = False
position = buffer.InStr(position, eol)

  If position > 0 Then
    result.Append(buffer.Mid(start, position - start))
    start = position + 1
    position = position + 2
    appended = True
  End If

Loop Until Not appended Or position = length

If start < length - 1 Then
  rest = buffer.Mid(start)
Else
  rest = kEmpty
End If

If t.EOF And rest <> kEmpty Then
  result.Append(rest)
  rest = kEmpty
End If

Return result

Else
rest = kEmpty
End If
End Function[/code]