Regex Search and Simplified Chinese

The data and the search pattern work correctly in RegExRX so the must be a problem with this last bit of code. All the data looks correct in the debugger in Wordlist but when I step through it it still allows more than 4 letter words that contain a foreign character??

Suggestions?

Actually a closer look and a few more runs RegExRX doesn’t show the correct data and does match my code results
 look below. Why does the code and RegExRX not search for whole words \b?

@Kem_Tekinay just checking if you can see why the search isn’t picking up whole words.

A stab in the dark, but
 are you sure the line endings in your sample data are as expected and consistent throughout the data you’re testing?

The \b word boundary token seems to be to fail around characters beyond a single-byte set, so you can’t rely on it here.

Try something like this instead:

^\pL{4}(?=\PL|$)
1 Like

Bbedit says Unix LF

@Martin_Fitzgibbons,

I’ve been burnt before by the “not so obvious stuff” like mixed line endings in a file and since the search looked like it was checking “per line”, I thought it worth mentioning.

In this case it sounds like Kem has the real answer, hopefully.

1 Like

That specific example works but does that mean the rest of the general searches I was using are also restricted in Regex to single byte?
.*
AB.{3,5}
.*ly


.{3,6}

First, avoid .* if you can. Be as specific in your patterns are you can.

The answer appears to be no, the dot token matches any character. The problem appears to be with the \b word boundary token.

You can also use \X which matches anything, including newlines.

Thanks looks like some rethinking if I want to add extra languages.
Last question before the rethink
How would I rewrite a search for words starting with AB with 3-5 letters
^\pL\bAB\w{1,3}\b

\bAB\pL{1,3}(?=$|\PL)

\b = word boundary
\pL{1,3} = any Letter, one to three times
(?=$|\PL) = lookahead: next character is end of line or something other than a letter

1 Like

How do you change that to any number of letters if you have to avoid .*

Promise last question

{1,}?

* = zero or more, e.g., \pL*
+ = one or more, e.g., \pL+
{x} = exactly x, e.g., \pL{2}
{x,y} = min x, max y, e.g., \pL{2,4}
{x,} = min x to unlimited, e.g., \pL{3,}

Using + is the same as {1,}.

Using * is the same as {0,}.

Thanks Kem that should get me to the end :slight_smile:

1 Like

For the general community, while pondering Martin’s challenge and Kem’s suggested solutions, I stumbled across this:

It’s a really nice resource.

I’d almost say it was perfect, but it doesn’t link to Kem’s great RegExRX software. (which I own and highly recommend) :grinning:

3 Likes

Kem,

Is the \b word boundary failure for double byte characters fixable? Is it a Xojo plugin problem or something else?

It’s in the PCRE engine itself. I just tried it through the pcre2grep command line utility and it fails to match \b after a â€œĂŒâ€.

1 Like