RegEx to remove non-alphanumerics

RegEx has me beat, I’ve read docs, read tutorials and can’t get a damned thing to work!

I’m trying to use it to clean up text and importantly remove non-alpha numerics from a string, but all it does is remove a single space.

[code]Dim reg as new RegEx
reg.searchPattern = “[^a-zA-Z0-9]”
reg.replacementPattern = “”

Dim source as string = “wiou fhbv6H98E61-5L!@&L$!@O ^?W""><{O)_R)&&L!_”“D”
Dim result as string = reg.replace( source )
msgBox source + endOfLine + result[/code]

[code] Dim reg as new RegEx
reg.searchPattern = “[^a-zA-Z0-9]”
reg.replacementPattern = “”
reg.Options.ReplaceAllMatches = True

Dim source as string = “wiou fhbv6H98E61-5L!@&L$!@O ^?W""><{O)_R)&&L!_”“D”
Dim result as string = reg.replace( source )
msgBox source + endOfLine + result[/code]

Huh! That simple, I think I just won moron of the week award!

Thanks Syed.

Sam, a few things about that pattern.

  • It will remove spaces. I assume that’s what you want?
  • The RegEx class is case INsensitive unless you change the option, so while a-zA-Z is not wrong, it is redundant.
  • That pattern will remove accented characters like ü or é. If that’s not what you want, try this instead:
[^\\pL\\pN]

That uses the Unicode properties of each character to determine if it’s a letter or number, and it will only work in newer versions of Xojo or the latest MBS plugins.

Thanks Kem,
I basically want to strip it down to either numbers or letter. My original plan was to remove all non english characters as well, but now you mention it, I may have to rethink it.

It’s for renaming files so that they’re safe to post to online.

Perhaps ConvertEncoding to ASCII first? That will replace things like ù with just u. Then you should just be able to ReplaceAll( s, “?”, “-” ) (or something).

Or run it through your RegEx after ConvertEncoding if you want to be sure to remove Windows-illegal characters.

Thanks so much Kem for the extra advice… Pretty darn cool what can be done now!.

“mj ? ? Mac aplicacin” = “mojMacAplicacion”

Which is what I originally wanted, but with the preserving accents option, it comes out “mj?MacAplicacin”!

Can I use subsequent RegEx to then remove sequential occurrences of a string? For instance, if the replacement character for the above string is “_”, it then reads “moj_____Mac_Aplicacion”, which to me looks wrong, so I’d like to replace the “_____” with just a single underscore.

I’ve got it… After much failure, it’s incredibly simple, that is removing multiple repeating characters.

searchPattern = “+"
replacementPattern = "

Horay!

There is a way to “emulate” the Unicode Blocks?
Like \p{InLatin-1_Supplement}

You can remove squeeze any repeating character with:

rx.SearchPattern = "(.)\\g1+"
rx.ReplacementPattern = "$1"

If you only want to squeeze symbols (and you’ve already converted to ASCII), you can use a variation of your original pattern:

rx.SearchPattern = "([^a-z0-9])\\g1+"
rx.ReplacementPattern = "$1"

[quote=103406:@Antonio Rinaldi]There is a way to “emulate” the Unicode Blocks?
Like \p{InLatin-1_Supplement}[/quote]

Since the blocks are continuous, you can look for the range of code points (listed here, among other places).

In the example you mentioned, that’s range U+0080 through U+00FF. The pattern “\x{NNNN}” will let you specify a code point, and you can put the range within a character class, so:

rx.SearchPattern = "[\\x{80}-\\x{FF}]"

Odd how the link came in, and that it won’t let me edit it now. It’s this:

http://en.wikipedia.org/wiki/Unicode_block

You can get even more information here:

http://www.regular-expressions.info/unicode.html

Get in line Sam. No queue jumping allowed.