Can you do a Regex search a Dictionary type if the data is Simplified Chinese?
err = new RegExException
redim whereToPut(-1) // reset results array
Found = false
rg = New RegEx // setup for search via RegEx
// make sure there is something in whatToFind
if whatToFind = "" then
Raise err
end if
// check for proper search syntax
rg.SearchPattern = "[^a-zA-Z0-9{}.,*]" // limit search characters
if rg.Search(whatToFind) <> nil then
raise err
end if
// Set search for input length of word
rg.SearchPattern = "^"+whatToFind+"\b"
// setup search index
if midB(whatToFind,1,1) = "." then
j = 0
k = UBound(wordList)
else
temp = asc(Uppercase(mid(whatToFind,1,1))) - 65 // index a=0
if temp > 0 then
j = wordIndex(temp) //start of a letter
k = wordIndex(temp + 1) - 1 // end of a letter
else
exit
end if
end if
// Search the wordlist using RegEx
For i = j to k
if rg.search(wordList(i)) <> nil then
Found = true
whereToPut.append wordList(i)
end if
next
return Found
I donât have an issue here and I tried in both RegExRX and code.
var rx as new RegEx
rx.SearchPattern = "Ă«"
var match as RegExMatch = rx.Search( Dutch_Regex_Test )
if match is nil then
MessageBox "Nope"
else
MessageBox match.SubExpressionString( 0 ) // I get here
end if
When I search my Chinese text file I canât get the search to find anything. @AlbertoD suggested the following search patter for extended Dutch characters [^\x{00eb}a-zA-Z0-9{}.,] which seems to work for Dutch but not if I enter any Chinese eg äž.
Thanks, yes I am after the alphabet range for Dutch, ChineseâŠ
Is it possible to cover the characters of most European languages and Chinese or does that become too broad?
After a lot of reading and failing I came up with the following search pattern thinking was that this covered Simplified Chinese character range + any Unicode letter + 0-9 plus .,* which I use as wildcards
I tried your App RegExRX to try and get a better handle on the search patterns but still canât get the right patter for the Chinese set [App is pretty useful for other reasons though]
Do you get an error?
What does the error show?
Maybe a RegExSearchPatternException?
Do you get any info if you click the Exception, maybe there is something in Message?
Why are you using \u if you saw the pattern working with \x ?
Shouldnât \p{L} include âany languageâ including Chinese?
Dim rg As RegEx
Dim i,j,k As Integer
Dim Found As Boolean
Dim err As RegExException
Dim temp as Integer
err = new RegExException
redim whereToPut(-1) // reset results array
Found = false
rg = New RegEx // setup for search via RegEx
// make sure there is something in whatToFind
if whatToFind = "" then
Raise err
end if
// check for proper search syntax
'rg.SearchPattern = "[^a-zA-Z0-9{}.,*]" // Original A-Z search
rg.SearchPattern = "[^\pL0-9{}.,*]" // Extended search characters
if rg.Search(whatToFind) <> nil then
raise err
end if
rg.SearchPattern = "^"+whatToFind+"\b"
// setup search index
if whatToFind.Middle(1,1) = "." then
j = 0
k = UBound(wordList)
else
temp = asc(Uppercase(whatToFind.Middle(1,1)))' - 65 // index a=0
if temp > 0 then
j = wordIndex(temp) //start of a letter
k = wordIndex(temp + 1) - 1 // end of a letter
else
exit
end if
end if
// Search the wordlist using RegEx
For i = j to k
if rg.search(wordList(i)) <> nil then
Found = true
whereToPut.append wordList(i)
end if
next
return Found
Which was working fine. I can probably adapt the searching to find any 30 words in the dictionary but that kind of defeats the purpose or I could do less languages. If I use the original search pattern it just works with English. If I try the extended Unicode search I loose all the search restrictions and can only get random words of any length and no Chinese at all.
Hi Alberto,
Yes it wasnât initially clear to me either. The code is about 20years old and initially part of disctionary module by John Thomas (Thanks John). I adapted it with his permission and probably didnât clean out what I didnât need over time and now I am trying to incorporate other languages it needs looking at.
Long story but the age of the code explains what John was doing to speed up searching. So the dictionary needs to be sorted to work and Wordindex held the position of where each alphabet letter starts and finishes so the Regex search can jump straight to that location.
err = new RegExException
redim whereToPut(-1) // reset results array
Found = false
rg = New RegEx // setup for search via RegEx
// make sure there is something in whatToFind
if whatToFind = "" then
whatToFind = ".*"
'Raise err
end if
rg.SearchPattern = "[^\pL0-9{}.,*]" // Extended search characters
// Verify for proper search syntax
if rg.Search(whatToFind) <> nil then
raise err
end if
//As per the screenshot I want to be able to search the dictionary
//for 4 letter words eg .... or words ending in LY eg .*LY
//Is this the correct Search Pattern, what does \b do?
var ss as string = "^"+whatToFind+"\b"
rg.SearchPattern = ss
// Search dictionary and store matches in Worldlist
For i = 0 to wordList.LastIndex
if rg.search(wordList(i)) <> nil then
Found = true
whereToPut.append wordList(i)
end if
next
return Found
The results are interesting now that I am including other languages when I search using âŠ
English return only four letter words
German, Turkish, Portuguese, Swedish, Polish⊠return mostly four letter words but every time the word contains a special letter like âĂłâ , âĂĄâ etc then it adds words of more letters??
In the code above if whatToFind =ââŠâ then the Searchpattern = â^âŠ\bâ which works so long as there are no special characters
This seems to be correct, start at the beginning of the line, match any four letters with a word boundary