saving a dictionary

For fun, here’s a version using Ptrs with 3 memoryblocks, which benchmarks even slower than the Split version by about 80%…

Your version using only 1 memoryblock is faster than this one:

  #pragma BackgroundTasks False
  #pragma BoundsChecking False
  
  dim t0 as double
  dim theseLines() as string
  dim fld as string
  for i as integer = 1 to 100000
    fld = format( i, "000000" )
    theseLines.Append fld + Chr( 9 ) + fld
  next
  dim t as string = join( theseLines, EndOfLine )
  
  dim msg as string
  dim txt as string = t
  txt = ReplaceLineEndings( txt, EndOfLine )
  
  
  t0 = Microseconds  // start the timer here, so we are only benchmarking the conversion code, not the setup code
  
  dim lines() as string = txt.Split( EndOfLine )
  dim d as new Dictionary
  for each l as string in lines
    dim flds() as string = l.Split( Chr( 9 ) )
    if flds.Ubound = 1 then
      d.Value( flds( 0 ) ) = flds( 1 )
    end if
  next
  
  msg = format( (microseconds - t0)/1000, "#," ) + " msec" + EndOfLine
  msg = msg + "Dictionary Count: " + str( d.Count )+ EndOfLine
  
  dim enc as TextEncoding = Encodings.UTF8
  t = txt.ConvertEncoding( enc )
  t = ReplaceLineEndings( t, Chr( 13 ) )
  
  
  dim m as MemoryBlock = t  // this causes a string  to MemoryBlock conversion which can be slow
  
  t0 = Microseconds  // start the timer here, so we are only benchmarking the conversion code, not the setup code
  
  dim p as Ptr = m
  
  d = new Dictionary
  dim lastPos as integer = m.Size - 1
  dim thisPos as integer
  
  dim mKey, mVal as MemoryBlock
  mKey = new MemoryBlock(1024)
  mVal = new MemoryBlock(1024)
  dim pKey as Ptr = mKey
  dim pVal as Ptr = mVal
  dim kIndx, vIndx as integer ' index into key,value MemoryBlock
  dim c as integer 
  dim readingValue as boolean = false
  
  while thisPos <= lastPos
    c = p.byte(thisPos)
    if readingValue then
      ' we are reading the value bytes
      if c = 13 or thisPos >= lastPos then 
        ' we are done - we got the key and the value, save it and move on
        readingValue = false
        d.value(mKey.StringValue(0,kIndx)) = mVal.StringValue(0,vIndx)
        kIndx = 0
        vIndx = 0
      else
        pVal.Byte(vIndx)=c
        vIndx = vIndx+1
      end if
    else
      ' we are reading the key bytes
      if c = 9 then
        ' we are done reading the key, now switch to value reading mode
        readingValue = true
      else
        pKey.byte(kIndx)=c
        kIndx = kIndx +1
      end if
    end if
    thisPos = thisPos + 1
  wend
  
  
  msg = msg + format( (microseconds-t0) /1000, "#," ) + " msecs" + EndOfLine
  msg = msg + "Dictionary Count: " + str( d.Count )
  MsgBox msg

Just want to make sure: You’re compiling for the tests, right? Or are these results within the IDE?

No, I was testing in the IDE which is stupid.

Testing a compiled app:

  • Split 269msec
  • Ptr (with 3 memoryBlocks) : 269msec

A Tie at best.

Note: I did not include the timing of the String To MemoryBlock conversion. Whether to include this or not depends on whether the source data would come from a memoryBlock or a String. If your data would always be in string format, it’s arguable that it’s cheating to leave this out.

Given the original requirement of loading a Dictionary from tab-delimted text, I think you have to include it.

BTW, if I take out the timing for encoding conversion and ReplaceLineEndings, my MemoryBlock code is significantly faster than Split. Using your parameter of 100,000 lines, it times at 315 ms for the MemoryBlock vs. 390 ms for Split, and Split vs. SplitB doesn’t really make a difference (surprisingly).

If I put back the code that defines the encoding of both the key and value as UTF8, the MemoryBlock version still wins at about 361 ms.

Conclusion: Yes, you can make something faster than Split for a savings of about 8%, but it’s hardly worth it for the extra code. Agree?

Depending on what you’re storing in the dictionary, you might try my dictionary subclass that saves a dictionary out to XML and lets you read it back again. Out of the box it only supports a few datatypes, but it’s extensible so you can add other datatypes. I wrote about it in issue 8.2 issue 8.2 of RSD. I use it for exactly the type of thing you’re doing: loading everything into a dictionary initially, and saving it so it doesn’t have to be recreated from scratch the next time. You can download the source code for free at the above link, though to read the article requires purchase of the issue.

What about a binary format? simply use a binary stream, write a Uint32 indicating the length of the chunk, then the chunk?