Dictionary Size Crashing? Combining Two Text Files

Charley_Collins · October 28, 2015, 12:30pm

I have two thesaurus text files that I am attempting to combine into a larger thesaurus file. Each file has a number lines of text, with each line consisting of a list of words separated by commas. The first word represents the main entry word one would lookup in a thesaurus while the remaining words on each line (which are listed in alphabetical order) are synonyms for the first word.

A typical line would be as follows - “sharp” is the main entry word - the remaining words are synonyms of “sharp”:

sharp,abrupt,acerb,acerbic,acuate,acute,astringent,astute,carnassial,crisp,cutting

This code has two main parts:

Initially open first file to count lines to display progress of process via a label.
Open first text file and add it to dictionary. This file has just over 30,000 lines of text.
Initially open second file to count lines to display progress of process via label.
Go through each line from the 2nd text file - it has about 142,000 lines and do one of the following:

A. If dictionary does not already have a matching main entry word from the line of text being read then add that line to the dictionary.

B. If dictionary already has a matching main entry word then go through each word on the line and only add those synonyms not already in the dictionary for that main entry word.

C. Save as text file.

I’ve used very similar code before; however, the application suddenly quits during the second process and never saves the resulting file. I suspect that the dictionary may be getting too large and perhaps it may be running out memory; however, I do have 48GB installed so I am not sure if that is a problem or not. I have no idea if there is a memory leak or not. A message window is displayed by the OS (10.9.5) simply saying that the application had suddenly quit, but gives no details - this happens in both the IDE and a compiled app.

  Dim F as FolderItem
  Dim TempIn, TextIn as TextInputStream
  Dim TextOut as TextOutputStream
  Dim CatLine, CatList, CatListWord, CatWord, DicList, TempArray(-1), TempList, TempText, TLine, TList, TWord, Unique as String
  Dim D2, DoIt, TempCount as Integer
  Dim D as New Dictionary

  F = GetFolderItem("").Child("mthesaur.txt")//Open Existing Thearsus & Construct Dictionary
  
  TempIn = F.OpenAsTextFile
  
  If TempIn <> Nil then
    TempText = TempIn.ReadAll
    TempCount = CountFields(TempText, EndOfLine)
    Label1.Text = Str(TempCount)
    Label1.Refresh
    TempText = ""
    TempCount = 0
  End If
  
  TempIn.Close
  F = GetFolderItem("").Child("mthesaur.txt")
  
  If F.Exists and F <> Nil then
    
    TempIn = F.OpenAsTextFile
    TempIn.Encoding = Encodings.ASCII
    
    If TempIn <> Nil then
      
      Do
        
        TLine = TempIn.ReadLine
        TWord = NthField(TLine, ",", 1)
        TList = Mid(TLine, Instr(TLine, ",") + 1)
        D.Value(TWord) = TList
        Label1.Text = Str(Val(Label1.Text) - 1)
        Label1.Refresh
        
      Loop Until TempIn.EOF
      
    End If
    
  End If
  
  TempIn.Close
  
  
  F = GetFolderItem("").Child("MyThes-1.0").Child("parsedic")
  
  If F.Exists and F <> Nil then
    
    TempIn = F.OpenAsTextFile
    
    If TempIn <> Nil then
      TempText = TempIn.ReadAll
      TempCount = CountFields(TempText, EndOfLine)
      Label1.Text = Str(TempCount)
      Label1.Refresh
      TempIn.Close
    End If
    
    
    TextIn = F.OpenAsTextFile
    
    If TextIn <> Nil then
      
      TextIn.Encoding = Encodings.ASCII
      
      Do
        
        CatLine = TextIn.ReadLine
        Label1.Text = Str(Val(Label1.Text) - 1)
        Label1.Refresh
        
        CatWord = NthField(CatLine, ",", 1)//First Word on Line is Main Entry Word
        CatList = Mid(CatLine, Instr(CatLine, ",") + 1)//Remaining Words on Line After 1st Comma are Synonyms
        
        If D.HasKey(CatWord) then//Existing Main Word in Thesaurus
          
          TempList = D.Value(CatWord)
          
          For DoIt = 1 to CountFields(TempList, ",")
            CatListWord = NthField(TempList, ",", DoIt)
            If Instr(Unique, CatListWord + ",") = 0 then
              Unique = Unique + CatListWord + ","
            End If
          Next
          
          Unique = Left(Unique, Len(Unique) - 1)//Remove Trailing Comma
          
          TempArray = Split(Unique, ",")
          TempArray.Sort
          Unique = Join(TempArray, ",")
          
          D.Value(CatWord) = Unique//Reset Dictionary
          
        Else//No Existing Main Word in Thesaurus
          
          TempArray = Split(Unique, ",")
          TempArray.Sort
          Unique = Join(TempArray, ",")
          
          D.Value(CatWord) = Unique
          
        End If
        
      Loop Until TextIn.EOF
      
    Else
      
      'MsgBox "Could not open the file."
      
    End If
    
    
  Else
    
    'MsgBox "The file does not exist."
    
  End If
  
  TextIn.Close
  
  
  Label1.Text = Str(D.Count)
  Label1.Refresh
  
  For D2 = 0 to D.Count - 1
    DicList = DicList + D.Key(D2) + "," + D.Value(D.Key(D2)) + EndOfLine
    Label1.Text = Str(Val(Label1.Text) - 1)
    Label1.Refresh
  Next
  
  F = GetFolderItem("").Child("MyThes-1.0").Child("combineddic")
  
  If F <> Nil then
    
    TextOut = TextOutputStream.Create(F)
    TextOut.Write DicList
    TextOut.Close
    
  End If

The second part of this question is what pragma code should I add to this method to speed it up such as:

#pragma DisableBackgroundTasks
#pragma NilObjectChecking
#pragma StackOverflowChecking

And where should I add it in the method - at the top or just inside the main loops?

Note:

This is a utility app I am using to prepare data for another app so I’ve not used a thread & timer as I am not concerned with the app’s window being manipulated while the method is running.

Greg_O_Lone · October 28, 2015, 1:04pm

Two other ideas…

First (although it’ll be slow because of multiple passes) read file 1, line 1 into an array. Look for matches in file two. If item to add, add to array, sort array, write to file three, rinse, repeat.

Another way would be to use an in-memory SQLite database. It would be a single pass to get data in, but a little more complicated to set up.

Kem_Tekinay · October 28, 2015, 1:15pm

Or even an on-disk SQLite database.

Norman_P · October 28, 2015, 1:46pm

merge sort with disk files

Jeff_Tullin · October 28, 2015, 2:38pm

Im with Greg.
assuming no key word is duplicated, here is one possible option.

Create a database table holding two columns

keyword, synonym

process the two files in the same way
for each file
open the file
Read a line.
For each synonym
insert into the table a new row of keyword & one of the synonyms
//you could search the table to see if they exist, but insert regardless is probably faster at this point)
next
until end of file
next file

At the end of that process you have every word & synsonym from both files in a single table.

get one record set of all the keywords

select distinct keyword from mytable

loop through that list and for each get a set of all the synonyms

select distinct synonym from mytable where keyword = 'theword' order by synonym

concatenate the synonyms, and write your result to disk, then move to new keyword

John_A_Knight_Jr · October 29, 2015, 2:25am

Use a SQLite file as suggested.

CREATE TABLE Synonyms ( word TEXT, syn TEXT, CONSTRAINT syn_key PRIMARY KEY (word,syn) ON CONFLICT IGNORE );
INSERT the word pairs, one pass for each input file.
As long as the case (upper/lower) matches, duplicate pairs will be quietly IGNOREd (no errors).

Then:

SELECT word, syn FROM Synonyms ORDER BY word, syn;
will give you all words & synonyms sorted.

Jeff_Tullin · October 29, 2015, 6:19am

nice.

Beatrix_Willius · October 29, 2015, 6:25am

I consider 4GB as barely useable for any Xojo work. Have you monitored your RAM to see if memory is free or not?

+1 for the database approach.

James_Sentman · October 29, 2015, 1:26pm

Thats a lot of records… I can point out a couple of things and suggestions though.

In the second file read youre not clearing the tempText variable that you use to count fields so youre ending up with that larger file in memory twice as you load it the second time line by line. Perhaps this is whats putting your memory use over the edge?

Did you say you had 4 gig or 48 gig? Because 48 gig would certainly be plenty… I have 16 in this machine and have never run into a problem like that.

If its a memory problem youre then ending up with even more copies of the data as you concatenate a gigantic string at the end to write in one step. If you need to write it in one step I would remove the dictionary entries as you add to the string so the memory use isnt so much. You can also walk and write each line one at a time to the output file rather than build that big string. That will use much less memory.

There are a couple of suggestions Id make for speeding it up.

The single biggest thing youre doing speed wise there after reading the data is force refreshing the interface. If you comment out those lines youll find that it runs MUCH faster but that you dont get any feedback on the process.

You can force the updates only every 10 or 50 records read and thats usually plenty for a user to watch the progress. So do a count of the records read and written and then do something like:

if (ReadCount mod 25) = 0 then
textLabelWhatever.text = new data
textLabelWhatever.refresh
end if

and youll be very happy with the increase in speed.

The initial reads where you load the entire file and do a count fields are going to be slow and memory intensive too. You could switch from reading the whole file and doing countFields to using just the folderitem.length and keeping track of the textInputStream.positionB as you read. You can use that to provide a progress bar (as long as you dont update it with every read!) and if you want to do a percent complete you can divide the 2 numbers to get the percent into the file you are. You wouldnt be able to display an actual count of individual words, but that probably would be OK to sacrifice to get the speedup.

That will help far more than the pragmas.

You can eliminate a lot of the string parsing. In some places youre using split and join but in others youre nthfielding things, which is slow in loops. You can actually store an array of string in the dictionary! So there isnt any reason to convert to a string, then convert to an array, then back to a string with each line.

something like:

dim workArray() as string
dim workCount as integer

Do
workCount = workCount + 1
workArray = split( Templn.ReadLine, ,)
D.Value( workArray(0)) = workArray

if (workCount mod 25) = 0 then
Label1.Text = Str(InitialCountOrSizeFromSomewhere - workCount)
Label1.Refresh
end if
loop until tempIn.EOF

and then it gets much simpler to add the second file: Could be simplified down to something like:

Do
    
    workArray = split( TextIn.ReadLine, ,)

    if d.hasKey( workArray(0)) then // join the 2 arrays together leaving out duplicates
      dim joinArray() as string = d.value( workArray(0))
      dim arrayCount as integer = ubound( workArray)

      for i = 1 to arrayCount //skipping the first entry

        if joinArray.indexOf( workArray( i)) = -1 then //if its already in there, ignore
          joinArray.append( workArray(0))
        end if

       //since youre adding data to an array that is already in the dictionary you dont have to even re-add it to the dictionary!

      next
    else //not already in there, just create
      d.value( workArray(0)) = workArray
    end if
    
  Loop Until TextIn.EOF

now you have a dictionary with all the arrays of words in it, and the key word is already in position 0 so you dont have to do any special handling of that (though that means the root word is in memory twice for each one if memory is whats causing the problem)

For writing it back out just rejoin them as you write, no need to create a big string:

dim writeCount as integer = d.count-1
for i = 0 to writeCount
textOut.writeline( join( d.value( d.key( i)), ,))
next

now if you REALLY wanted to speed it up you could read the entire data file in large chunks and then parse them out in memory rather than using the textInputStream. But you can only use a binary stream in that way so youd have to either write your own readline method in it looking for the next end of line, or use instrb (the b variants are all you need if youre just reading ASCII which you are since I see the encoding and the b variants are considerably faster since they dont have to worry about text encoding and can just look at bytes) but thats for further optimization if you sort out the memory crash issue and its not fast enough after doing those things.

I really wouldnt mess with an SQL database for this. You really should be able to do it like that if the memory crashing issue can be overcome.

Tim_Hare · October 29, 2015, 6:39pm

4 or 48 makes no difference. Your app can probably only use 3GB.

Charley_Collins1 · November 3, 2015, 5:57pm

Here’s an update that seems to work:

I have two synonym text files that I am attempting to combine. One file has 140,000 lines and the other has 30,000 lines. Each line of text consists of a main subject word followed by one or more synonyms, with all words separated by commas. Some lines may only have two words while other lines may have dozens of words.

File 1:
main subject,synonym,synonym,synonym,synonym
main subject,synonym,synonym,synonym,synonym
main subject,synonym,synonym,synonym,synonym
…

File 2:
main subject,synonym,synonym,synonym,synonym
main subject,synonym,synonym,synonym,synonym
main subject,synonym,synonym,synonym,synonym
…

Main subject words may be listed in both files; however, the list of synonyms for each shared main subject are not the same so one has combine any lines with the same main subject word, but without duplicating the synonyms. Main subject words are not duplicated within each file.

Although I used dictionary based code similar to what I previous posted to parse both of these files I was not able to use a dictionary to combine them. I have 48MB of RAM installed and had plenty of free RAM for this task; however, the Xojo built app would gradually increase its RAM use and crash soon after reaching 3.24GB in the Activity Monitor.

I suspect there could be a memory leak being caused by the previous code dictionary code as parsing each of these of the aobve files separately within a dictionary was not problem. The code below keeps the app in the range of 43MB to 55MB of RAM while its working and takes about 37 minutes, which could probably be improved; however, this just a utility app I made to parse the above files for use with another app so it will not see any other users.

  Dim F, F2, F3, F4 as FolderItem
  Dim TextIn, TempIn as TextInputStream
  Dim TextOut as TextOutputStream
  Dim CatList, CatL, CatList2(-1), CatTemp(-1), Catter(-1), CatWord, ParseText as String
  Dim Parse(-1), Thea(-1) as String
  Dim Begin, D2, D3, DoIt, DoIt2 as Integer
  Dim P1, T1, Remaining, Remaining2 as Integer
  
  #pragma BackgroundTasks False
  #pragma BoundsChecking False
  #pragma BreakOnExceptions False
  #pragma NilObjectChecking False
  #pragma StackOverflowChecking False

  F3 = GetFolderItem("").Child("combineddic")
  TextOut = TextOutputStream.Create(F3)
  
  F = GetFolderItem("").Child("MyThes-1.0").Child("parsedic")
  
  If F <> Nil then
    
    TempIn = F.OpenAsTextFile
    TempIn.Encoding = Encodings.ASCII
    
    If TempIn <> Nil then
      
      Do
        
        Parse.Append(TempIn.ReadLine)
        
      Loop Until TempIn.EOF
      
    End If
    
    TempIn.Close
    
    Label1.Text = "List 1 to Array Done"
    Label1.Refresh
    
  End If
  
  F2 = GetFolderItem("").Child("mthesaur.txt")
  
  If F2 <> Nil then
    
    TextIn = F2.OpenAsTextFile
    TextIn.Encoding = Encodings.ASCII
    
    If TextIn <> Nil then
      
      Do
        
        Thea.Append(TextIn.ReadLine)
        
      Loop Until TextIn.EOF
      
      TextIn.Close
      
      Label1.Text = "List 2 to Array Done"
      Label1.Refresh
      
    End If
    
    Do
      
      CatWord = Mid(Thea(0), 1, Instr(Thea(0), ",") - 1)
      
      CatL = Mid(Thea(0), Instr(Thea(0), ",") + 1) + ","
      
      Remaining2 = Parse.UBound
      
      For DoIt = Remaining2 DownTo 0
        
        If CatWord = Mid(Parse(DoIt), 1, Instr(Parse(DoIt), ",") - 1) then
          
          CatList2 = Split(Mid(Parse(DoIt), Instr(Parse(DoIt), ",") + 1), ",")
          
          Parse.Remove DoIt
          
          For DoIt2 = 0 to CatList2.Ubound
            
            If Instr(CatL, CatList2(DoIt2) + ",") = 0 then
              CatL = CatL + CatList2(DoIt2) + ","
              
            End If
            
          Next
          
          Exit
        End If
        
      Next
      
      CatL = Left(CatL, Len(CatL) - 1)
      Catter = Split(CatL, ",")
      Catter.Sort
      
      TextOut.WriteLine CatWord + "," + Join(Catter, ",")
      
      If Thea.UBound Mod 100 = 0 then
        Label1.Text = Str(Thea.UBound)
        Label1.Refresh
      End If
      
      Thea.Remove(0)
      
    Loop Until Thea.UBound = -1
    
  End If
  
  Parse.Sort
  
  For D2 = 0 to Parse.Ubound
    If F3 <> Nil then
      TextOut.WriteLine Parse(D2)
    End If
  Next
  
  TextOut.Close
  
  Label1.Text = "1st Save Complete"
  Label1.Refresh
  
  
  F3 = GetFolderItem("").Child("combineddic")
  
  If F3 <> Nil then
    
    TempIn = F3.OpenAsTextFile
    TempIn.Encoding = Encodings.ASCII
    
    If TempIn <> Nil then
      
      Do
        
        CatTemp.Append(TempIn.ReadLine)
        
      Loop Until TempIn.EOF
      
    End If
    
    CatTemp.Sort
    TempIn.Close
    
    Label1.Text = "Sort Complete"
    Label1.Refresh
    
  End If
  
  F4 = GetFolderItem("").Child("combinedthes")
  
  If F4 <> Nil then
    
    TextOut = TextOutputStream.Create(F4)
    
    For D2 = 0 to CatTemp.Ubound
      
      TextOut.WriteLine CatTemp(D2)
      
    Next
    
    TextOut.Close
    Label1.Text = "Final Save Complete"
    Label1.Refresh
    
  End If
  
  #pragma BackgroundTasks True
  #pragma BoundsChecking True
  #pragma BreakOnExceptions True
  #pragma NilObjectChecking True
  #pragma StackOverflowChecking True

Charley_Collins1 · November 8, 2015, 9:40am

Here’s the final version.

  Dim F, F2, F3 as FolderItem
  Dim TextIn, TempIn as TextInputStream
  Dim TextOut as TextOutputStream
  Dim CatList, CatL, CatList2(-1), Catter(-1), CatWord as String
  Dim Parse(-1), Thea(-1) as String
  Dim D2, D3, DoIt, DoIt2 as Integer
  Dim Remaining, Remaining2 as Integer
  
  #pragma BackgroundTasks False
  #pragma BoundsChecking False
  #pragma BreakOnExceptions False
  #pragma NilObjectChecking False
  #pragma StackOverflowChecking False

  F3 = GetFolderItem("").Child("combinedthes")
  TextOut = TextOutputStream.Create(F3)
  
  F = GetFolderItem("").Child("MyThes-1.0").Child("parsedic")
  
  If F <> Nil then
    
    TempIn = F.OpenAsTextFile
    TempIn.Encoding = Encodings.ASCII
    
    If TempIn <> Nil then
      
      Do
        Parse.Append(TempIn.ReadLine)
      Loop Until TempIn.EOF
      
    End If
    
    TempIn.Close
    
    Label1.Text = "List 1 to Array Done"
    Label1.Refresh
    
  End If
  
  F2 = GetFolderItem("").Child("mthesaur.txt")
  
  If F2 <> Nil then
    
    TextIn = F2.OpenAsTextFile
    TextIn.Encoding = Encodings.ASCII
    
    If TextIn <> Nil then
      
      Do
        Thea.Append(TextIn.ReadLine)
      Loop Until TextIn.EOF
      
      TextIn.Close
      
      Label1.Text = "List 2 to Array Done"
      Label1.Refresh
      
    End If
    
    Do
      
      CatWord = Mid(Thea(0), 1, Instr(Thea(0), ",") - 1)
      
      CatL = "," + Mid(Thea(0), Instr(Thea(0), ",") + 1) + ","
      
      Remaining2 = Parse.UBound
      
      For DoIt = Remaining2 DownTo 0
        
        If CatWord = Mid(Parse(DoIt), 1, Instr(Parse(DoIt), ",") - 1) then
          
          CatList2 = Split(Mid(Parse(DoIt), Instr(Parse(DoIt), ",") + 1), ",")
          
          Parse.Remove DoIt
          
          For DoIt2 = 0 to CatList2.Ubound
            
            If Instr(CatL, "," + CatList2(DoIt2) + ",") = 0 then
              CatL = CatL + CatList2(DoIt2) + ","
            End If
            
          Next
          
          Exit
        End If
        
      Next
      
      CatL = Left(CatL, Len(CatL) - 1)
      CatL = Right(CatL, Len(CatL) - 1)
      Catter = Split(CatL, ",")
      Catter.Sort
      
      Parse.Append CatWord + "," + Join(Catter, ",")
      
      If Thea.UBound Mod 100 = 0 then
        Label1.Text = Str(Thea.UBound)
        Label1.Refresh
      End If
      
      Thea.Remove(0)
      
    Loop Until Thea.UBound = -1
    
  End If
  
  Parse.Sort
  
  For D2 = 0 to Parse.Ubound
    If F3 <> Nil then
      If D2 <> Parse.Ubound then
        TextOut.WriteLine Parse(D2)
      Else
        TextOut.Write Parse(D2)
      End If
    End If
  Next
  
  TextOut.Close
  
  Label1.Text = "Save Complete"
  Label1.Refresh
  
  #pragma BackgroundTasks True
  #pragma BoundsChecking True
  #pragma BreakOnExceptions True
  #pragma NilObjectChecking True
  #pragma StackOverflowChecking True