RegEx Search Pattern with Unicode Text

Denise_Adams · March 9, 2017, 10:10am

I’m trying to remove the control codes in the range U+0080U+009F from UTF8 text using the pattern below but it also strips out some accented unicode character by mistake such as É:

re.SearchPattern = "[\\x7F-\\x9F]+"

I checked the language reference and it says RegEx can’t handle double-byte text and I have to convert it to another encoding first but that seems a strange solution, having to convert it to WIndows 1252 and then back to UTF8?

Is there a better way to achieve this?

Kem_Tekinay · March 9, 2017, 5:24pm

re.SearchPattern = "\\p{Cc}+"

Test to make sure though.

Denise_Adams · March 9, 2017, 7:06pm

Thanks Kem but I think this would strip out all control codes and I need to specify ranges so I can optionally keep Chr(13) and Chr(9) etc to manually remove if required.

Kem_Tekinay · March 9, 2017, 7:15pm

What about a negative lookbehind to exclude “good” characters? For example:

\\p{Cc}(?<![\\x01-\\x20])

If you are just going to Replace All, you don’t need the repeater.

Denise_Adams · March 9, 2017, 8:42pm

What do you mean by “repeater”?

Kem_Tekinay · March 9, 2017, 8:47pm

The +, meaning “one or more”.

Denise_Adams · March 9, 2017, 9:48pm

Gotcha. Thanks I’ll try this out.

Denise_Adams · March 10, 2017, 11:49am

[quote=319681:@Kem Tekinay]What about a negative lookbehind to exclude “good” characters? For example:

\\p{Cc}(?<![\\x01-\\x20])

If you are just going to Replace All, you don’t need the repeater.[/quote]

This worked but actually it turns out the problem was not the RegEx but my code. My original search pattern also worked when I realized that pasting from the clipboard worked but a binary read from file did not.

Even though both methods called the same utility method to convert the string to UTF 8 and strip the control characters, it turned out that I still need to specify the Encodings.UTF8 for the binary read otherwise the RegEx caused the problem I encountered.

Thanks for your help, Kem!