I’m trying to remove the control codes in the range U+0080U+009F from UTF8 text using the pattern below but it also strips out some accented unicode character by mistake such as É:
re.SearchPattern = "[\\x7F-\\x9F]+"
I checked the language reference and it says RegEx can’t handle double-byte text and I have to convert it to another encoding first but that seems a strange solution, having to convert it to WIndows 1252 and then back to UTF8?
Is there a better way to achieve this?
re.SearchPattern = "\\p{Cc}+"
Test to make sure though.
Thanks Kem but I think this would strip out all control codes and I need to specify ranges so I can optionally keep Chr(13) and Chr(9) etc to manually remove if required.
What about a negative lookbehind to exclude “good” characters? For example:
\\p{Cc}(?<![\\x01-\\x20])
If you are just going to Replace All, you don’t need the repeater.
What do you mean by “repeater”?
The +, meaning “one or more”.
Gotcha. Thanks I’ll try this out.
[quote=319681:@Kem Tekinay]What about a negative lookbehind to exclude “good” characters? For example:
\\p{Cc}(?<![\\x01-\\x20])
If you are just going to Replace All, you don’t need the repeater.[/quote]
This worked but actually it turns out the problem was not the RegEx but my code. My original search pattern also worked when I realized that pasting from the clipboard worked but a binary read from file did not.
Even though both methods called the same utility method to convert the string to UTF 8 and strip the control characters, it turned out that I still need to specify the Encodings.UTF8 for the binary read otherwise the RegEx caused the problem I encountered.
Thanks for your help, Kem!