Regex Search and Simplified Chinese

Martin_Fitzgibbons · February 15, 2024, 8:18am

Can you do a Regex search a Dictionary type if the data is Simplified Chinese?


err = new RegExException

redim whereToPut(-1)                      // reset results array
Found = false                                  
rg = New RegEx                                        // setup for search via RegEx

// make sure there is something in whatToFind
if whatToFind = "" then
  Raise err
end if
// check for proper search syntax
rg.SearchPattern = "[^a-zA-Z0-9{}.,*]" // limit search characters
if rg.Search(whatToFind) <> nil then
  raise err
end if

// Set search for input length of word

rg.SearchPattern = "^"+whatToFind+"\b" 

// setup search index
if midB(whatToFind,1,1) = "." then
  j = 0
  k = UBound(wordList)
else
  temp = asc(Uppercase(mid(whatToFind,1,1))) - 65  // index a=0 
  if temp > 0 then
    j = wordIndex(temp)                       //start of a letter
    k = wordIndex(temp + 1) - 1          // end of a letter
  else
    exit
  end if
end if

// Search the wordlist using RegEx
For i = j to k         
  if rg.search(wordList(i)) <> nil then
    Found = true
    whereToPut.append wordList(i)
  end if
next

return Found

Kem_Tekinay · February 15, 2024, 2:42pm

As long both the source and pattern are encoded as UTF-8, it shouldn’t be a problem. But I have not tried.

Martin_Fitzgibbons · February 21, 2024, 11:53am

I have the file loaded UTF-8 and can see it in the debugger

But when I do the search it doesn’t find anything. Do I have to modify search pattern restrictions
rg.SearchPattern = “[^a-zA-Z0-9{}.,*]”

Kem_Tekinay · February 21, 2024, 3:36pm

Can you post the actual file here?

Martin_Fitzgibbons · April 4, 2024, 11:49am

Here is a Dutch file but same problem

Try searching for ‘ë’

Kem_Tekinay · April 4, 2024, 12:47pm

I don’t have an issue here and I tried in both RegExRX and code.

var rx as new RegEx
rx.SearchPattern = "ë"
var match as RegExMatch = rx.Search( Dutch_Regex_Test )
if match is nil then
  MessageBox "Nope"
else
  MessageBox match.SubExpressionString( 0 )  // I get here
end if

Martin_Fitzgibbons · April 4, 2024, 9:41pm

What would you use as a general search pattern for languages like Dutch and Chinese?

Kem_Tekinay · April 4, 2024, 9:51pm

That’s too broad a question. I’d have to know what pattern you’re looking to match, and the source you’re matching against.

Martin_Fitzgibbons · April 4, 2024, 10:21pm

If this is the current search pattern what needs to be added to include the extended character sets

Martin_Fitzgibbons · April 4, 2024, 11:24pm

When I search my Chinese text file I can’t get the search to find anything. @AlbertoD suggested the following search patter for extended Dutch characters [^\x{00eb}a-zA-Z0-9{}.,] which seems to work for Dutch but not if I enter any Chinese eg 万.

AlbertoD · April 4, 2024, 11:32pm

It is only for ë as you asked on the other thread.

This is the character you just posted as an example:

so you need to change (or add) 4E07 for this character.

You may want to add a range, if so take a look at this block and check what you need to add to your pattern:

Hope this help and is more clear.

Martin_Fitzgibbons · April 4, 2024, 11:41pm

Thanks, yes I am after the alphabet range for Dutch, Chinese…
Is it possible to cover the characters of most European languages and Chinese or does that become too broad?

Kem_Tekinay · April 4, 2024, 11:52pm

The Unicode tokens should be able to help you here.

rx.SearchPattern = "\pL" // any unicode letter
rx.SearchPattern = "\PL" // anything that is NOT a Unicode letter

You can also specify particular languages through scripts like so:

rx.SearchPattern = "\p{Latin}"

Martin_Fitzgibbons · April 5, 2024, 1:28am

After a lot of reading and failing I came up with the following search pattern thinking was that this covered Simplified Chinese character range + any Unicode letter + 0-9 plus .,* which I use as wildcards

rg.SearchPattern = “[^\u4E00-\u9FFF\p{L}0-9{}.,*]+”

What did I miss?

Martin_Fitzgibbons · April 5, 2024, 3:04am

I tried your App RegExRX to try and get a better handle on the search patterns but still can’t get the right patter for the Chinese set [App is pretty useful for other reasons though]

AlbertoD · April 5, 2024, 3:23am

Do you get an error?
What does the error show?
Maybe a RegExSearchPatternException?
Do you get any info if you click the Exception, maybe there is something in Message?
Why are you using \u if you saw the pattern working with \x ?
Shouldn’t \p{L} include ‘any language’ including Chinese?

Martin_Fitzgibbons · April 5, 2024, 7:44am

This seems to be getting quite a bit more complicated. The original Method allowed for some searching of the Dictionary in English

Dim rg As RegEx
Dim i,j,k As Integer
Dim Found As Boolean
Dim err As RegExException
Dim temp as Integer

err = new RegExException

redim whereToPut(-1)                      // reset results array
Found = false                                  
rg = New RegEx                               // setup for search via RegEx

// make sure there is something in whatToFind
if whatToFind = "" then
  Raise err
end if
// check for proper search syntax
'rg.SearchPattern = "[^a-zA-Z0-9{}.,*]" // Original A-Z search
rg.SearchPattern = "[^\pL0-9{}.,*]" // Extended search characters

if rg.Search(whatToFind) <> nil then
  raise err
end if

rg.SearchPattern = "^"+whatToFind+"\b" 

// setup search index
if whatToFind.Middle(1,1) = "." then
  j = 0
  k = UBound(wordList)
else
  temp = asc(Uppercase(whatToFind.Middle(1,1)))' - 65  // index a=0 
  if temp > 0 then
    j = wordIndex(temp)                       //start of a letter
    k = wordIndex(temp + 1) - 1          // end of a letter
  else
    exit
  end if
end if

// Search the wordlist using RegEx
For i = j to k         
  if rg.search(wordList(i)) <> nil then
    Found = true
    whereToPut.append wordList(i)
  end if
next

return Found

Which was working fine. I can probably adapt the searching to find any 30 words in the dictionary but that kind of defeats the purpose or I could do less languages. If I use the original search pattern it just works with English. If I try the extended Unicode search I loose all the search restrictions and can only get random words of any length and no Chinese at all.

AlbertoD · April 5, 2024, 12:30pm

Sorry, is not clear (to me) what you are trying to do and I don’t know what wordIndex and wordList are.

Martin_Fitzgibbons · April 7, 2024, 4:50am

Hi Alberto,
Yes it wasn’t initially clear to me either. The code is about 20years old and initially part of disctionary module by John Thomas (Thanks John). I adapted it with his permission and probably didn’t clean out what I didn’t need over time and now I am trying to incorporate other languages it needs looking at.
Long story but the age of the code explains what John was doing to speed up searching. So the dictionary needs to be sorted to work and Wordindex held the position of where each alphabet letter starts and finishes so the Regex search can jump straight to that location.


err = new RegExException

redim whereToPut(-1)                      // reset results array
Found = false                                  
rg = New RegEx                               // setup for search via RegEx

// make sure there is something in whatToFind
if whatToFind = "" then
  whatToFind = ".*"
  'Raise err
end if

rg.SearchPattern = "[^\pL0-9{}.,*]" // Extended search characters

// Verify for proper search syntax
if rg.Search(whatToFind) <> nil then
  raise err
end if

//As per the screenshot I want to be able to search the dictionary
//for 4 letter words eg .... or words ending in LY eg .*LY

//Is this the correct Search Pattern, what does \b do?
var ss as string = "^"+whatToFind+"\b"
rg.SearchPattern =  ss

// Search dictionary and store matches in Worldlist 
For i = 0 to wordList.LastIndex   
  if rg.search(wordList(i)) <> nil then
    Found = true
    whereToPut.append wordList(i)
  end if
next

return Found

Martin_Fitzgibbons · April 7, 2024, 6:53am

The results are interesting now that I am including other languages when I search using …

English return only four letter words
German, Turkish, Portuguese, Swedish, Polish… return mostly four letter words but every time the word contains a special letter like ‘ó’ , ‘á’ etc then it adds words of more letters??

In the code above if whatToFind =“…” then the Searchpattern = “^…\b” which works so long as there are no special characters

This seems to be correct, start at the beginning of the line, match any four letters with a word boundary