saving a dictionary

I am working with a dictionary that takes 5-10 minutes to create each time I run my program. I would like to be able to save this dictionary and read it in rather that creating it from scratch each time the program is run. Is there a way to save the dictionary so that it can easily and efficiently be read in when I run the program again?

Thanks.

I have to ask, what the heck would take 5-10 minutes to create?

The answer to your question depends on what’s in the Dictionary, and which platform you’re targeting. If it’s all human-readable stuff (for example, strings and not classes), then your best bet would be to transfer it to and from a Plist. The MacOSLib project contains tools to let you do that easily.

I’m also curious…

Hi Kem and Eric -

The dictionary actually contains a list of gene names for yeast (key) and the names of corresponding genes in other species (tab delimited string). I have stored to source text file (generated from an Excel spreadsheet) within the application and parse it using nthfield which I suspect is the slow part. There are about 14,000 genes/rows in the data. Would it be quicker to store the data as two arrays - one for keys and one for values - and then use those to generate the dictionary?

What would be great is if I could just store the dictionary in the application once it is created. That way it would not have to be generated each time the application starts.

Bob

NthField is very slow when used repeatedly and should be avoided if you plan to call it more than once or twice. Try something like this:

dim txt as string = TextOfTheFile
txt = ReplaceLineEndings( txt, EndOfLine )
dim lines() as string = txt.Split( EndOfLine )
dim d as new Dictionary
for each l as string in lines
  dim flds() as string = l.Split( Chr( 9 ) )
  if flds.Ubound = 1 then
    d.Value( flds( 0 ) ) = flds( 1 )
  end if
next

Bet you’ll think it didn’t work because it’s so fast. :slight_smile:

Thanks, Kem. I’ll give it a try later (I have some meetings to go to now).

Bob

No problem, I timed it for you. Within the IDE, it takes about 50 ms to load the Dictionary from a text file of 14,000 lines where each line is a tab-delimited string and the first column is unique. Here is my test code:

  dim theseLines() as string
  dim fld as string
  for i as integer = 1 to 14000
    fld = format( i, "000000" )
    theseLines.Append fld + Chr( 9 ) + fld
  next
  dim t as string = join( theseLines, EndOfLine )
  
  dim msg as string
  dim sw as new Stopwatch_MTC
  sw.Start
  
  dim txt as string = t
  txt = ReplaceLineEndings( txt, EndOfLine )
  dim lines() as string = txt.Split( EndOfLine )
  dim d as new Dictionary
  for each l as string in lines
    dim flds() as string = l.Split( Chr( 9 ) )
    if flds.Ubound = 1 then
      d.Value( flds( 0 ) ) = flds( 1 )
    end if
  next
  
  sw.Stop
  msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
  AddToResult msg
  msg = "Dictionary Count: " + str( d.Count )
  AddToResult msg

It sounds to me like this is a place to use a DATABASE table with an appropriate index…

build it once… drag it into the IDE as a resource… have the application use BINARYSTREAM to write it to the users drive on first execution… and be done.

Kem, I just tested your code and it works perfectly. You were right… it was so fast that I was not sure it actually executed, but it did. I did not realize that nthfield was so slow. I’ll have to go back and look at some of my other code to see if it can be improved.

Thanks!

Bob

My pleasure.

Just for some background, NthField is comparatively slow because it always has to start from the beginning of the string each time. If your string has a few fields, this isn’t a big deal, but when you have 14,000 line, and it has to start from line 1 on every call, the time to execute would be the similar to the time it would take to count to 14,000 if you had to count like this:

1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5…

Split goes through the string only once and breaks it up as it goes, so it is much faster.

Execution time of NthField rises exponentially as you increase the field number.

For example, if you call NthField(MyString, " ", 1) to get the first word in the string, NthField starts at 1 and counts up to 1 (taking X milliseconds). NthField(MyString, " ", 2) takes X^2 milliseconds; NthField(MyString, " ", 14000) takes x^14000 milliseconds.

To answer your original question, there is no native way to save the contents of a Dictionary. You’ll have to write code that iterates through each Key and saves it to a file along with its associated Value.

Note that this will only be “easy” for Keys and Values that convert directly to and from strings. Objects do not, and thus to have this work with Object keys or values you will need to write code that handles each different class that you anticipate being stored in the Dictionary.

Andrew: Why do you say that this is an exponential process? If it takes x seconds to find the first field, shouldn’t it take 2x to find the second field and 3x to find the third field?

Eric: I guess it is no longer necessary to save the dictionary. All I need to do is store the text file in the application itself and then create the dictionary from the text file on the fly (since it is so fast).

[quote=41274:@Andrew Lambert]Execution time of NthField rises exponentially as you increase the field number.

For example, if you call NthField(MyString, " ", 1) to get the first word in the string, NthField starts at 1 and counts up to 1 (taking X milliseconds). NthField(MyString, " ", 2) takes X^2 milliseconds; NthField(MyString, " ", 14000) takes x^14000 milliseconds.[/quote]

Actually, I believe that it’s Order N^2 ( see http://en.wikipedia.org/wiki/Big_O_notation )

Other thoughts:

  • NthField is slower than NtheFieldB : the B version treats text as bytes, whereas the regular version handles unicode text (which may have multi-byte characters). If you are sure your text is plain ASCII, you could use the B version.
  • Split() is very fast, but it still has a lot of overhead as it’s creating arrays. If you need something faster, you will want to play around with MemoryBlocks and Pointers and scan through the data byte-by-byte.

You’re probably right: I’m still wrapping my head around big-O notation and the theory behind it (I also suck at math.)

FWIW, I have been using the Split() function and all the others related methods or functions almost everywhere in my code, and I’ve never had a speed issue. May be MemoryBlocks are faster, but I’m not sure the gain of speed is worth the code readability loss.

Because I was curious, and a little nuts, I decided to test the two methods, Split vs. a MemoryBlock. It turns out that the latter is faster by 2-3 ms, but only if you discard encodings. It hardly seems worth it in this case, but perhaps someone can see a way to improve it. My code for both is:

  #pragma BackgroundTasks False
  #pragma BoundsChecking False
  
  dim theseLines() as string
  dim fld as string
  for i as integer = 1 to 14000
    fld = format( i, "000000" )
    theseLines.Append fld + Chr( 9 ) + fld
  next
  dim t as string = join( theseLines, EndOfLine )
  
  dim msg as string
  dim sw as new Stopwatch_MTC
  sw.Start
  
  dim txt as string = t
  txt = ReplaceLineEndings( txt, EndOfLine )
  dim lines() as string = txt.Split( EndOfLine )
  dim d as new Dictionary
  for each l as string in lines
    dim flds() as string = l.Split( Chr( 9 ) )
    if flds.Ubound = 1 then
      d.Value( flds( 0 ) ) = flds( 1 )
    end if
  next
  
  sw.Stop
  msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
  AddToResult msg
  msg = "Dictionary Count: " + str( d.Count )
  AddToResult msg
  
  sw.Reset
  sw.Start
  
  dim enc as TextEncoding = Encodings.UTF8
  t = txt.ConvertEncoding( enc )
  t = ReplaceLineEndings( t, Chr( 13 ) )
  dim m as MemoryBlock = t
  dim p as Ptr = m
  
  d = new Dictionary
  dim keyPosStart, keyPosEnd as integer
  dim valPosStart, valPosEnd as integer
  dim mbPos as integer
  dim lastPos as integer = m.Size - 1
  dim thisPos as integer
  while keyPosStart < lastPos
    for thisPos = keyPosStart to lastPos
      if p.Byte( thisPos ) = 9 then
        keyPosEnd = thisPos
        valPosStart = thisPos + 1
        exit
      end if
    next
    
    valPosEnd = lastPos + 1
    for thisPos = valPosStart to lastPos
      if p.Byte( thisPos ) = 13 then
        valPosEnd = thisPos
        exit
      end if
    next
    
    d.Value( m.StringValue( keyPosStart, keyPosEnd - keyPosStart ) ) = m.StringValue( valPosStart, valPosEnd - valPosStart )
    keyPosStart = valPosEnd + 1
    
  wend
  
  sw.Stop
  msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
  AddToResult msg
  msg = "Dictionary Count: " + str( d.Count )
  AddToResult msg

Fun!

I made a few changes to your demo:

  • increased the N to 100000
  • start the timer AFTER the ReplacelineEndings call (since we aren’t really interested in the speed of that).

Results:

  • Split: 257
  • Ptr : 293

The split version is actually a little faster - and since the Ptr version is more complicated, the Split version is the winner.

However, I bet the Ptr version can be improved. I’ll see what I can do.

At best, ReplaceLineEndings is a wash as the timing of it should either be included in both versions or disregarded for each. But since it’s essential to either version, I left it in both.

I’m curious to see what you might come up with. Also, SplitB might be faster than Split if the text is converted to UTF8 beforehand. That conversion might take time too, of course.