I am parsing a large thesaurus file containing 147,159 lines of text. This is loaded into a app variable upon start.
Each line consists of a key search word followed by a series of synonyms, which are separated by commas. There are over 3.1 million words or phrases separated by commas.
Individual words can be appear multiple times in the file as a main entry word and / or synonym.
cat,alley cat, feline, …
feline,alley cat, cat, …
tiger,cat,feline, …
I wish to know if I can make any improvements to the code below to make it faster. The code is placed inside a thread. A pushbutton enables a timer and starts the thread. The timer is used to update a Label which displays the total count of commas separated feeds in the file and the current progress of thread. When the Label is first updated it will initially make several large jumps of many thousands (6,500 to 7,500 per second or less) and then gradually slow down about 900 to 1,000 each second or less.
There is 48GB of RAM installed in this Mac.
I only wish to extract a list of unique words from 3 to 36 letter from the file.
I do not want to include any phrases (synonyms with spaces).
I also do not want any words with hyphens, apostrophes, periods or numbers.
I do not want words which are capitalized
Dim DoIt, DoIt2, LenWord, BanCount, InNewArr as Integer
Dim DicWordList, Word, BanList, BanLetter, DicArr(), NewArr() as String
Dim XFile as FolderItem
Dim TextOut as TextOutputStream
XFile = GetFolderItem("").Child("XFile.txt")
If XFile <> Nil then
XFile.Delete
End If
DicWordList = App.ThesaurusList
DicWordList = ReplaceAll(DicWordList, EndofLine, " ")//147,159 Thearus Entries Seperated by EndofLine
DicArr = Split(DicWordList, ",")
DicWordCount = UBound(DicArr) + 1//3,125,756 Words or Phrases Seperated By Commas
//Do Not Want:
//Phrases Containing a Space
//Words with Hyphen (H-bomb), Apostrophe(can't), Periods (a.k.a.) or Numbers
BanList = " -'.0123456789"
BanCount = Len(BanList)
For DoIt = 0 to UBound(DicArr)
DicCurrentCount = DoIt + 1
Word = DicArr(DoIt)
LenWord = Len(Word)
For DoIt2 = 1 to BanCount
BanLetter = Mid(BanList, DoIt2, 1)
If Instr(Word, BanLetter) > 0 then
Continue For DoIt
End If
Next
//Only Want Words From 3 to 36 Letters
//Do not Want Capitalized Words
If LenWord < 3 and LenWord > 36 then
Continue
Elseif StrComp(Left(Word, 1), Uppercase(Left(Word, 1)), 0) = 0 then
Continue
End If
//Only Add Word If Not Already Present
InNewArr = NewArr.IndexOf(Word)
If InNewArr < 0 then
NewArr.Append(Word)
End If
Next
NewArr.Sort
If XFile<> Nil then
TextOut = TextOutputStream.Append(XFile)
TextOut.Write Join(NewArr, ",")
TextOut.Close
End If