Improving Thread Code to Parse Large Text File?

I am parsing a large thesaurus file containing 147,159 lines of text. This is loaded into a app variable upon start.

Each line consists of a key search word followed by a series of synonyms, which are separated by commas. There are over 3.1 million words or phrases separated by commas.

Individual words can be appear multiple times in the file as a main entry word and / or synonym.

cat,alley cat, feline, …

feline,alley cat, cat, …

tiger,cat,feline, …

I wish to know if I can make any improvements to the code below to make it faster. The code is placed inside a thread. A pushbutton enables a timer and starts the thread. The timer is used to update a Label which displays the total count of commas separated feeds in the file and the current progress of thread. When the Label is first updated it will initially make several large jumps of many thousands (6,500 to 7,500 per second or less) and then gradually slow down about 900 to 1,000 each second or less.

There is 48GB of RAM installed in this Mac.

I only wish to extract a list of unique words from 3 to 36 letter from the file.

I do not want to include any phrases (synonyms with spaces).

I also do not want any words with hyphens, apostrophes, periods or numbers.

I do not want words which are capitalized

  Dim DoIt, DoIt2, LenWord, BanCount, InNewArr as Integer
  Dim DicWordList, Word, BanList, BanLetter, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then
    XFile.Delete
  End If
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  //Do Not Want:
      //Phrases Containing a Space
      //Words with Hyphen (H-bomb), Apostrophe(can't), Periods (a.k.a.) or Numbers
  BanList = " -'.0123456789"
  BanCount = Len(BanList)
  
  For DoIt = 0 to UBound(DicArr)
    
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    LenWord = Len(Word)
    
    For DoIt2 = 1 to BanCount
      BanLetter = Mid(BanList, DoIt2, 1)
      If Instr(Word, BanLetter) > 0 then
        Continue For DoIt
      End If
    Next
    
    //Only Want Words From 3 to 36 Letters
    //Do not Want Capitalized Words
    If LenWord < 3 and LenWord > 36 then
      Continue
    Elseif StrComp(Left(Word, 1), Uppercase(Left(Word, 1)), 0) = 0 then
      Continue
    End If
    
    //Only Add Word If Not Already Present
    InNewArr = NewArr.IndexOf(Word)
    
    If InNewArr < 0 then
      NewArr.Append(Word)
    End If
    
  Next
  
  NewArr.Sort
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

You should be able to have regex do the validation work for you. The pattern in this snippet matches only lower letters, so your numbers, apostrophes, hyphens, periods, and spaces will be excluded. It’ll also exclude everything else that isn’t a lowercase letter. The {3-36} handles your length requirement too.

I’m sure there are other improvements that can be made, such as using MBS regex instead for better performance. Xojo’s regex is kind of slow, but I’d expect still faster than looping over every character yourself.

Var Pattern As New Regex
Pattern.SearchPattern = "^[a-z]{3-36}$"
Pattern.Options.CaseSensitive = True

For DoIt = 0 To UBound(DicArr)
  DicCurrentCount = DoIt + 1
  
  Word = DicArr(DoIt)
  
  If (Pattern.Search(Word) Is Nil) = False Then
    NewArr.Append(Word)
  End If
Next

I’d use this pattern:

(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)
1 Like

And use a Dictionary instead of an array to initially store the words. That will take care of duplicate handling for you.

1 Like

Where MBS’s RegEx implementation would be better is that the code could feed it the entire text rather than feeding it one line at a time. But when feeding it one line at a time, there shouldn’t be much difference.

Feeding it one at a time allows for progress though. The dictionary is a good point, I missed the need for duplicate filtering.

Here’s my shot at the code (untested):

DicWordList = App.ThesaurusList
DicWordList = DicWordList.ReplaceLineEndings( &uA )
var lines() as string = DicWordList.Split( &uA )

var rx as new RegEx
rx.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"

var wordDict as new Dictionay

for each line as string in lines
  var match as RegExMatch = rx.Search( line )
  while match isa RegExMatch
    wordDict.Value( match.SubExpressionString( 0 ) ) = nil
    rx.Search
  wend
next line

var DicArr() as string
for each word as variant in wordDict.Keys
  DicArr.AddRow word.StringValue
next

DicArr.Sort
1 Like

its not better to put it once in a sqlite database?

I’d like to assume that would be the destination for the array.

I could not get a couple of lines of code to run in Kem’s example as they appear to be for newer versions of Xojo. Thom’s example runs but was missing duplicate checking.

I combined the two responses.

The following code completes the task in 14 seconds and uses less code, which is a vast improvement over my example.

  Dim DoIt as Integer
  Dim Word, DicWordList, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  Dim Pattern As New Regex
  Dim WordDict as new Dictionary
  
  Pattern.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"
  Pattern.Options.CaseSensitive = True
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  For DoIt = 0 To UBound(DicArr)
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    
    If (Pattern.Search(Word) Is Nil) = False Then
      WordDict.Value(Word) = Nil
    End If
  Next
  
  For Each Key As Variant In WordDict.Keys
    
    NewArr.Append(Key) 
    
  Next
  
  NewArr.Sort
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then//Delete Old Copy
    XFile.Delete
  End If
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

I could not get a couple of lines of code to run in Kem’s example as they appear to be for newer versions of Xojo. Thom’s example runs but was missing duplicate checking.

I combined the two responses.

The following code completes the task in 14 seconds.

  Dim DoIt as Integer
  Dim Word, DicWordList, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  Dim Pattern As New Regex
  Dim WordDict as new Dictionary
  
  Pattern.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"
  Pattern.Options.CaseSensitive = True
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  For DoIt = 0 To UBound(DicArr)
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    
    If (Pattern.Search(Word) Is Nil) = False Then
      WordDict.Value(Word) = Nil
    End If
  Next
  
  For Each Key As Variant In WordDict.Keys
    
    NewArr.Append(Key) 
    
  Next
  
  NewArr.Sort
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then//Delete Old Copy
    XFile.Delete
  End If
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

Your results will be incorrect because you’ve turned, for example:

this,that
in,out

into:

this,that in,out

Your array will thus be:

this
that in
out

and both “that” and “in” will be ignored since “that in” does not match the pattern.

Suggestions:

  • Use ReplaceLineEndings instead of ReplaceAll, and replace the EOL with “,” instead of a space.
  • Change the pattern to (?-i)[a-z]{3,36} because you are splitting by the commas anyway.
  • You can use If Pattern.Search(word) isa RegExMatch Then to see if there is a match. That’s a little easier to read.