Improving Thread Code to Parse Large Text File?

Charles_Greyson · July 28, 2020, 1:46pm

I am parsing a large thesaurus file containing 147,159 lines of text. This is loaded into a app variable upon start.

Each line consists of a key search word followed by a series of synonyms, which are separated by commas. There are over 3.1 million words or phrases separated by commas.

Individual words can be appear multiple times in the file as a main entry word and / or synonym.

cat,alley cat, feline, …

feline,alley cat, cat, …

tiger,cat,feline, …

I wish to know if I can make any improvements to the code below to make it faster. The code is placed inside a thread. A pushbutton enables a timer and starts the thread. The timer is used to update a Label which displays the total count of commas separated feeds in the file and the current progress of thread. When the Label is first updated it will initially make several large jumps of many thousands (6,500 to 7,500 per second or less) and then gradually slow down about 900 to 1,000 each second or less.

There is 48GB of RAM installed in this Mac.

I only wish to extract a list of unique words from 3 to 36 letter from the file.

I do not want to include any phrases (synonyms with spaces).

I also do not want any words with hyphens, apostrophes, periods or numbers.

I do not want words which are capitalized

  Dim DoIt, DoIt2, LenWord, BanCount, InNewArr as Integer
  Dim DicWordList, Word, BanList, BanLetter, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then
    XFile.Delete
  End If
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  //Do Not Want:
      //Phrases Containing a Space
      //Words with Hyphen (H-bomb), Apostrophe(can't), Periods (a.k.a.) or Numbers
  BanList = " -'.0123456789"
  BanCount = Len(BanList)
  
  For DoIt = 0 to UBound(DicArr)
    
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    LenWord = Len(Word)
    
    For DoIt2 = 1 to BanCount
      BanLetter = Mid(BanList, DoIt2, 1)
      If Instr(Word, BanLetter) > 0 then
        Continue For DoIt
      End If
    Next
    
    //Only Want Words From 3 to 36 Letters
    //Do not Want Capitalized Words
    If LenWord < 3 and LenWord > 36 then
      Continue
    Elseif StrComp(Left(Word, 1), Uppercase(Left(Word, 1)), 0) = 0 then
      Continue
    End If
    
    //Only Add Word If Not Already Present
    InNewArr = NewArr.IndexOf(Word)
    
    If InNewArr < 0 then
      NewArr.Append(Word)
    End If
    
  Next
  
  NewArr.Sort
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

Thom_McGrath · July 28, 2020, 2:21pm

You should be able to have regex do the validation work for you. The pattern in this snippet matches only lower letters, so your numbers, apostrophes, hyphens, periods, and spaces will be excluded. It’ll also exclude everything else that isn’t a lowercase letter. The {3-36} handles your length requirement too.

I’m sure there are other improvements that can be made, such as using MBS regex instead for better performance. Xojo’s regex is kind of slow, but I’d expect still faster than looping over every character yourself.

Var Pattern As New Regex
Pattern.SearchPattern = "^[a-z]{3-36}$"
Pattern.Options.CaseSensitive = True

For DoIt = 0 To UBound(DicArr)
  DicCurrentCount = DoIt + 1
  
  Word = DicArr(DoIt)
  
  If (Pattern.Search(Word) Is Nil) = False Then
    NewArr.Append(Word)
  End If
Next

Kem_Tekinay · July 28, 2020, 2:25pm

I’d use this pattern:

(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)

Kem_Tekinay · July 28, 2020, 2:27pm

And use a Dictionary instead of an array to initially store the words. That will take care of duplicate handling for you.

Kem_Tekinay · July 28, 2020, 2:29pm

Where MBS’s RegEx implementation would be better is that the code could feed it the entire text rather than feeding it one line at a time. But when feeding it one line at a time, there shouldn’t be much difference.

Thom_McGrath · July 28, 2020, 2:32pm

Feeding it one at a time allows for progress though. The dictionary is a good point, I missed the need for duplicate filtering.

Kem_Tekinay · July 28, 2020, 2:37pm

Here’s my shot at the code (untested):

DicWordList = App.ThesaurusList
DicWordList = DicWordList.ReplaceLineEndings( &uA )
var lines() as string = DicWordList.Split( &uA )

var rx as new RegEx
rx.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"

var wordDict as new Dictionay

for each line as string in lines
  var match as RegExMatch = rx.Search( line )
  while match isa RegExMatch
    wordDict.Value( match.SubExpressionString( 0 ) ) = nil
    rx.Search
  wend
next line

var DicArr() as string
for each word as variant in wordDict.Keys
  DicArr.AddRow word.StringValue
next

DicArr.Sort

MarkusR · July 28, 2020, 9:38pm

its not better to put it once in a sqlite database?

Thom_McGrath · July 28, 2020, 9:44pm

I’d like to assume that would be the destination for the array.

Charles_Greyson · July 29, 2020, 7:00am

I could not get a couple of lines of code to run in Kem’s example as they appear to be for newer versions of Xojo. Thom’s example runs but was missing duplicate checking.

I combined the two responses.

The following code completes the task in 14 seconds and uses less code, which is a vast improvement over my example.

  Dim DoIt as Integer
  Dim Word, DicWordList, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  Dim Pattern As New Regex
  Dim WordDict as new Dictionary
  
  Pattern.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"
  Pattern.Options.CaseSensitive = True
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  For DoIt = 0 To UBound(DicArr)
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    
    If (Pattern.Search(Word) Is Nil) = False Then
      WordDict.Value(Word) = Nil
    End If
  Next
  
  For Each Key As Variant In WordDict.Keys
    
    NewArr.Append(Key) 
    
  Next
  
  NewArr.Sort
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then//Delete Old Copy
    XFile.Delete
  End If
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

Charles_Greyson · July 29, 2020, 7:01am

I could not get a couple of lines of code to run in Kem’s example as they appear to be for newer versions of Xojo. Thom’s example runs but was missing duplicate checking.

I combined the two responses.

The following code completes the task in 14 seconds.

  Dim DoIt as Integer
  Dim Word, DicWordList, DicArr(), NewArr() as String
  Dim XFile as FolderItem
  Dim TextOut as TextOutputStream
  Dim Pattern As New Regex
  Dim WordDict as new Dictionary
  
  Pattern.SearchPattern = "(?-i)(?<=,|^)[a-z]{3,36}(?=,|$)"
  Pattern.Options.CaseSensitive = True
  
  DicWordList = App.ThesaurusList
  DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
  
  DicArr = Split(DicWordList, ",")
  DicWordCount = UBound(DicArr) + 1//3,125,756  Words or Phrases Seperated By Commas
  
  For DoIt = 0 To UBound(DicArr)
    DicCurrentCount = DoIt + 1
    
    Word = DicArr(DoIt)
    
    If (Pattern.Search(Word) Is Nil) = False Then
      WordDict.Value(Word) = Nil
    End If
  Next
  
  For Each Key As Variant In WordDict.Keys
    
    NewArr.Append(Key) 
    
  Next
  
  NewArr.Sort
  
  XFile = GetFolderItem("").Child("XFile.txt")
  
  If XFile <> Nil then//Delete Old Copy
    XFile.Delete
  End If
  
  If XFile<> Nil then
    TextOut = TextOutputStream.Append(XFile)
    TextOut.Write Join(NewArr, ",")
    TextOut.Close
  End If

Kem_Tekinay · July 29, 2020, 4:40pm

Your results will be incorrect because you’ve turned, for example:

this,that
in,out

into:

this,that in,out

Your array will thus be:

this
that in
out

and both “that” and “in” will be ignored since “that in” does not match the pattern.

Suggestions:

Use ReplaceLineEndings instead of ReplaceAll, and replace the EOL with “,” instead of a space.
Change the pattern to (?-i)[a-z]{3,36} because you are splitting by the commas anyway.
You can use If Pattern.Search(word) isa RegExMatch Then to see if there is a match. That’s a little easier to read.