Regex question

Jonathan_Ashwell · January 17, 2014, 2:54pm

I’m having a problem with RegEx not matching a unicode character and I wonder what I’m doing wrong. I have a list of words that I want to exclude from any capitalization when output, and I’m using RegEx to determine if a string contains those words, using \b to indicate word boundaries. It works with unaccented characters, but fails with accented characters. Here’s the code with a simple example:

dim theString as string = “This is test”
rg = new regEx
rg.Options.CaseSensitive = false

rg.SearchPattern = "\\b" + "" + "\\b" //whole word search
rm = rg.Search(theString)
while rm <> nil // -- rm is nii
   do something
wend

Everything is UTF-8, and the byte representations for are the same in all strings.

(I the LR, \b is shown as matching a backspace in one section, and as a word boundary in another. I’m assuming the first is a typo).

Any help is appreciated.

Kem_Tekinay · January 17, 2014, 3:42pm

I think you found a bug in the PCRE code. Your code is correct (although you can just use "\\b\\b" instead of "\\b" + "" + "\\b"), but it simply doesn’t recognize the word boundaries around the single high-ascii character.

Try this instead:

rg.SearchPattern = "(?<=\\s|^)(?=\\s|$)"

This uses lookarounds to emulate “\b”.

Kem_Tekinay · January 17, 2014, 3:46pm

Sorry, this is closer emulation, and I might still be forgetting something:

rg.SearchPattern = "(?<=[\\s[[:punct:]]|^)(?=[[:punct:]\\s]|$)"

Jonathan_Ashwell · January 17, 2014, 4:02pm

Thanks a lot, Kem! Your suggestion works great.

Jonathan_Ashwell · January 18, 2014, 12:52pm

There is a refinement I’d like to make to this search, but the search is complex enough that I haven’t been able to hit on it. I’d like to treat some words that contain punctuation as two words. For example, I’d like to treat “l’tat” as two words, “l” and “tat”. There are also Arabic words that start with a few characters, a hyphen, and then the rest of the word, like “al-”. Since I’m creating a stop list of words whose capitalization should be handled specially, in these cases I need to treat the prefix and the suffix differently (so for example, I could convert l’tat to l’tat"). What modification to the search is required to have it see ’ and - as word breaks? I thought [:punct:] might do that, but of course it does not.

Kem_Tekinay · January 18, 2014, 3:20pm

I just realized that Xojo now support the Unicode tokens of PCRE, so that changes how we can approach this. Try this pattern:

rg.SearchPattern = "(?<=^|\\PL)\\pL+(?=\\PL|$)"

This uses lookarounds to start the match after a non-letter or beginning of line, and end it before a non-letter or end of line. The actual match is one or more Unicode letters.

You can get a list of these Unicode properties and try different combinations in my app, RegExRX.

Jonathan_Ashwell · January 19, 2014, 12:24am

Do you mean a query like this?

(?<=^|\PL)\pL+(?=\PL|$)(?<=^|\PL)\pL+(?=\PL|$)

That doesn’t appear to work. I’ll download your app and play with it, thanks.

Kem_Tekinay · January 19, 2014, 6:28am

Based on what you’ve already mentioned, that doesn’t seem like the pattern you’d want. That pattern starts after a non-letter and matches one or more Unicode letters and stops when it reaches a non-letter. Then it looks for the literal token “”, but then confirms that the thing just matched is a non-letter, which is a contradiction. In other words, that will never match anything, ever.

I think you want to match a string of Unicode letters surrounded by non-letters, and that’s what the pattern I gave you will do. If I’m incorrect, please let me know specifically what you’re trying to match and I’ll try to help.

Again, the pattern I recommend is this:

rg.SearchPattern = "(?<=^|\\PL)\\pL+(?=\\PL|$)"

Jonathan_Ashwell · January 19, 2014, 3:42pm

Yes, I see what you’re doing now, it’s very cool. There are situations, though, when it is desirable to be able to search for multiple words at a once (since the RegEx above returns individual words, that’s not possible). I’m using a list of “stop words” whose case should never change. For example, “United States” should always be output with both words capitalized. In such cases, it would be useful to be able to search the string (sentence) for an exact match where the words “united states” appear.

Kem_Tekinay · January 19, 2014, 4:37pm

You could do something like this:

rg.SearchPattern = "(?<=^|\\PL)united states|other phrase|\\pL+"

You would fill in your stop words where it says “united states|other phrase”. You would only need to fill in the phrases with spaces or punctuation, and you’d separate each with a bar (the “|” means “or”). The final choice is “\pL+” which matches a single word.

(By the way, I realized the trailing lookbehind was not necessary, so I removed it.)

Jonathan_Ashwell · January 19, 2014, 7:47pm

Thank you once again. BTW, RegExRX is a very cool utility, I went ahead and bought it.

Kem_Tekinay · January 19, 2014, 7:58pm

Oh, thanks!

Axel_Schneider · February 6, 2014, 3:07am

I also have a question about regex.

I want to remove the text between two square brackets and the brackets also.

SearchPattern = “[.\[*].\]” - removes one letter or number and the brackets.

how can I remove everything if it is a different number of letters

Kem_Tekinay · February 6, 2014, 4:00am

If I understand, you want to turn text like this:

this is [some text] in brackets

into

this is  in brackets

In that case, you have the pattern sort of reversed:

\\[[^]]*\\]

Axel_Schneider · February 6, 2014, 4:07am

Thank you, that’s exactly what I was looking for.

Kem_Tekinay · February 6, 2014, 4:11am

No problem. Just to explain it for anyone following the thread, that says, match an opening bracket \\[, followed by zero or more characters that are not a closing bracket [^]]*, followed by the closing bracket \\].