I am working with a dictionary that takes 5-10 minutes to create each time I run my program. I would like to be able to save this dictionary and read it in rather that creating it from scratch each time the program is run. Is there a way to save the dictionary so that it can easily and efficiently be read in when I run the program again?
I have to ask, what the heck would take 5-10 minutes to create?
The answer to your question depends on what’s in the Dictionary, and which platform you’re targeting. If it’s all human-readable stuff (for example, strings and not classes), then your best bet would be to transfer it to and from a Plist. The MacOSLib project contains tools to let you do that easily.
The dictionary actually contains a list of gene names for yeast (key) and the names of corresponding genes in other species (tab delimited string). I have stored to source text file (generated from an Excel spreadsheet) within the application and parse it using nthfield which I suspect is the slow part. There are about 14,000 genes/rows in the data. Would it be quicker to store the data as two arrays - one for keys and one for values - and then use those to generate the dictionary?
What would be great is if I could just store the dictionary in the application once it is created. That way it would not have to be generated each time the application starts.
NthField is very slow when used repeatedly and should be avoided if you plan to call it more than once or twice. Try something like this:
dim txt as string = TextOfTheFile
txt = ReplaceLineEndings( txt, EndOfLine )
dim lines() as string = txt.Split( EndOfLine )
dim d as new Dictionary
for each l as string in lines
dim flds() as string = l.Split( Chr( 9 ) )
if flds.Ubound = 1 then
d.Value( flds( 0 ) ) = flds( 1 )
end if
next
Bet you’ll think it didn’t work because it’s so fast.
No problem, I timed it for you. Within the IDE, it takes about 50 ms to load the Dictionary from a text file of 14,000 lines where each line is a tab-delimited string and the first column is unique. Here is my test code:
dim theseLines() as string
dim fld as string
for i as integer = 1 to 14000
fld = format( i, "000000" )
theseLines.Append fld + Chr( 9 ) + fld
next
dim t as string = join( theseLines, EndOfLine )
dim msg as string
dim sw as new Stopwatch_MTC
sw.Start
dim txt as string = t
txt = ReplaceLineEndings( txt, EndOfLine )
dim lines() as string = txt.Split( EndOfLine )
dim d as new Dictionary
for each l as string in lines
dim flds() as string = l.Split( Chr( 9 ) )
if flds.Ubound = 1 then
d.Value( flds( 0 ) ) = flds( 1 )
end if
next
sw.Stop
msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
AddToResult msg
msg = "Dictionary Count: " + str( d.Count )
AddToResult msg
It sounds to me like this is a place to use a DATABASE table with an appropriate index…
build it once… drag it into the IDE as a resource… have the application use BINARYSTREAM to write it to the users drive on first execution… and be done.
Kem, I just tested your code and it works perfectly. You were right… it was so fast that I was not sure it actually executed, but it did. I did not realize that nthfield was so slow. I’ll have to go back and look at some of my other code to see if it can be improved.
Just for some background, NthField is comparatively slow because it always has to start from the beginning of the string each time. If your string has a few fields, this isn’t a big deal, but when you have 14,000 line, and it has to start from line 1 on every call, the time to execute would be the similar to the time it would take to count to 14,000 if you had to count like this:
1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5…
Split goes through the string only once and breaks it up as it goes, so it is much faster.
Execution time of NthField rises exponentially as you increase the field number.
For example, if you call NthField(MyString, " ", 1) to get the first word in the string, NthField starts at 1 and counts up to 1 (taking X milliseconds). NthField(MyString, " ", 2) takes X^2 milliseconds; NthField(MyString, " ", 14000) takes x^14000 milliseconds.
To answer your original question, there is no native way to save the contents of a Dictionary. You’ll have to write code that iterates through each Key and saves it to a file along with its associated Value.
Note that this will only be “easy” for Keys and Values that convert directly to and from strings. Objects do not, and thus to have this work with Object keys or values you will need to write code that handles each different class that you anticipate being stored in the Dictionary.
Andrew: Why do you say that this is an exponential process? If it takes x seconds to find the first field, shouldn’t it take 2x to find the second field and 3x to find the third field?
Eric: I guess it is no longer necessary to save the dictionary. All I need to do is store the text file in the application itself and then create the dictionary from the text file on the fly (since it is so fast).
[quote=41274:@Andrew Lambert]Execution time of NthField rises exponentially as you increase the field number.
For example, if you call NthField(MyString, " ", 1) to get the first word in the string, NthField starts at 1 and counts up to 1 (taking X milliseconds). NthField(MyString, " ", 2) takes X^2 milliseconds; NthField(MyString, " ", 14000) takes x^14000 milliseconds.[/quote]
NthField is slower than NtheFieldB : the B version treats text as bytes, whereas the regular version handles unicode text (which may have multi-byte characters). If you are sure your text is plain ASCII, you could use the B version.
Split() is very fast, but it still has a lot of overhead as it’s creating arrays. If you need something faster, you will want to play around with MemoryBlocks and Pointers and scan through the data byte-by-byte.
FWIW, I have been using the Split() function and all the others related methods or functions almost everywhere in my code, and I’ve never had a speed issue. May be MemoryBlocks are faster, but I’m not sure the gain of speed is worth the code readability loss.
Because I was curious, and a little nuts, I decided to test the two methods, Split vs. a MemoryBlock. It turns out that the latter is faster by 2-3 ms, but only if you discard encodings. It hardly seems worth it in this case, but perhaps someone can see a way to improve it. My code for both is:
#pragma BackgroundTasks False
#pragma BoundsChecking False
dim theseLines() as string
dim fld as string
for i as integer = 1 to 14000
fld = format( i, "000000" )
theseLines.Append fld + Chr( 9 ) + fld
next
dim t as string = join( theseLines, EndOfLine )
dim msg as string
dim sw as new Stopwatch_MTC
sw.Start
dim txt as string = t
txt = ReplaceLineEndings( txt, EndOfLine )
dim lines() as string = txt.Split( EndOfLine )
dim d as new Dictionary
for each l as string in lines
dim flds() as string = l.Split( Chr( 9 ) )
if flds.Ubound = 1 then
d.Value( flds( 0 ) ) = flds( 1 )
end if
next
sw.Stop
msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
AddToResult msg
msg = "Dictionary Count: " + str( d.Count )
AddToResult msg
sw.Reset
sw.Start
dim enc as TextEncoding = Encodings.UTF8
t = txt.ConvertEncoding( enc )
t = ReplaceLineEndings( t, Chr( 13 ) )
dim m as MemoryBlock = t
dim p as Ptr = m
d = new Dictionary
dim keyPosStart, keyPosEnd as integer
dim valPosStart, valPosEnd as integer
dim mbPos as integer
dim lastPos as integer = m.Size - 1
dim thisPos as integer
while keyPosStart < lastPos
for thisPos = keyPosStart to lastPos
if p.Byte( thisPos ) = 9 then
keyPosEnd = thisPos
valPosStart = thisPos + 1
exit
end if
next
valPosEnd = lastPos + 1
for thisPos = valPosStart to lastPos
if p.Byte( thisPos ) = 13 then
valPosEnd = thisPos
exit
end if
next
d.Value( m.StringValue( keyPosStart, keyPosEnd - keyPosStart ) ) = m.StringValue( valPosStart, valPosEnd - valPosStart )
keyPosStart = valPosEnd + 1
wend
sw.Stop
msg = format( sw.ElapsedMicroseconds, "#," ) + " microsecs"
AddToResult msg
msg = "Dictionary Count: " + str( d.Count )
AddToResult msg
At best, ReplaceLineEndings is a wash as the timing of it should either be included in both versions or disregarded for each. But since it’s essential to either version, I left it in both.
I’m curious to see what you might come up with. Also, SplitB might be faster than Split if the text is converted to UTF8 beforehand. That conversion might take time too, of course.