RegEX Befuddlement...

Jon_Ogden · June 13, 2015, 1:10am

Hey guys, I have a RegEx problem I’m trying to solve and don’t understand what is happening. So here is my RegEx:

(?mi-Us)(\\W*)([\\e]\\[([0-9])*([ABCDKJm])\\s?)+

I’m trying to catch VT-100 escape characters when connecting to the console of a managed switch either via RS-232 or Telnet. Now, the RegEx works just fine in Kem’s RegExRX. Catches the escape characters and groups everything just fine.

The Xojo Regex function does NOT work. It won’t catch or identify the escape characters. However, the MonkeyBreak RegEx classes work and catch them just fine. Same exact data. Same exact search pattern. RegExRX and MBS work, Xojo does not.

I don’t know why. The characters before the escape character 0x1B, are very high ASCII characters - with ASCII values of like 253, 255, etc. One thing I do notice when copying these characters into RegExRx is that RegExRx identifies their ASCII value differently than Xojo does. Still, it shouldn’t matter. I originally was using a pattern like:

(?mi-Us)([\\e]\\[([0-9])*([ABCDKJm])\\s?)+

Previously and that should still work and capture things. The addition of the (\W) grouping at the beginning was done to see if I could get Xojo to capture those along with the escape characters.

So have I found some sort of a bug in the Xojo RegEx class? Maybe Kem can answer this but I thought his RegExRx uses the Xojo class.

I’d rather use the Xojo class as I find it a little simpler than MBS and I’m finding MBS seems to be insetting some spaces when I do a replace and I don’t want spaces when I replace the identified escape codes.

Ideas?

Kem_Tekinay · June 13, 2015, 3:17am

RegExRX uses RegExMBS.

I can use \e to match ChrB( 27 ) in Xojo so it’s hard to see where the pattern fails without the data you’re matching against.

BTW, when you’re matching a single character, you don’t need to enclose it in a character class. “\e” works the same as “[\e]”.

Christian_Schmitz · June 13, 2015, 10:27am

MBS Class is a wrapper for PCRE and has more options. It can be harder but you should benefit from better speed and more features.

Suggestions for improvements are welcome.

Jon_Ogden · June 13, 2015, 2:38pm

[quote=193975:@Kem Tekinay]RegExRX uses RegExMBS.

I can use \e to match ChrB( 27 ) in Xojo so it’s hard to see where the pattern fails without the data you’re matching against.

BTW, when you’re matching a single character, you don’t need to enclose it in a character class. “\e” works the same as “[\e]”.[/quote]

I’ll have to fire up my code and connect to the switch in question and capture the output and post it here. I think it’s the control characters before \e that is causing the problem…

Jon_Ogden · June 13, 2015, 2:43pm

[quote=194013:@Christian Schmitz]MBS Class is a wrapper for PCRE and has more options. It can be harder but you should benefit from better speed and more features.

Suggestions for improvements are welcome.[/quote]

Actually, Christian, Kem and I have had discussions on this offline. I did a side by side test of MBS and Xojo RegEx speeds. In cases where there was a large amount of text to search through such as a large text document, MBS did much better. However, when dealing with small amounts of text data, such as a few lines output from a Telnet connection to a device, the Xojo class ran faster. I’ve generally stayed with the Xojo class because of it’s simplicity and the fact that I’m not generally scanning large amounts of text looking for patterns…

It’s definitely a bit more confusing and I’m having some trouble right now where I’m trying to search through data, find a match and then continue searching from where I left off. I’m using the offsets to specify where to search but in one instance, it’s not getting the starting point right and I’m ending up in an infinite loop. I should probably send you an e-mail about it directly. Just have not had the time.

I’m trying to get updates out for 3 different packages right now!

Jon_Ogden · June 13, 2015, 8:01pm

OK. Here’s an example of the text that I’m trying to parse:

???e[Kswitchc824c0#

Now, Xojo tells me that the corresponding Hex bits are:

FAFA FA1B 5848 7377 6974 6368 6338 3234 6330 23

If I copy the text into RegExRX, it tells me the hex bits are:

FFFD FFFD FFFD 1B 58 48

Now, it shouldn’t make a difference as there’s still hex 1B in there that is the escape character and the pattern should pick it up.

So I’m open to suggestions…

Or like I said, maybe I’ve found a bug in Xojo’s RegEx class.

Jon_Ogden · June 13, 2015, 9:31pm

I believe I have figured out how to properly search using the MBS Regex class. So I think my code is going to work OK.

Kem_Tekinay · June 13, 2015, 10:17pm

I have a theory that the native RegEx is UTF-8-centric and cannot convert your bytes. As soon as I assign an arbitrary encoding, in this case ISOLatin1, it works just fine.

Jon_Ogden · June 13, 2015, 10:19pm

Ah. I was experimenting with viewing different encodings in the debugger and nothing in terms of the bytes changed - just the representation. Since RegEx should be looking at the bytes, making it UTF-8 centric seems kind of silly.

Jon_Ogden · June 13, 2015, 10:23pm

By the way, I did try defining my encoding as UTF8 and also US ASCII and those made no difference. But Maybe the encoding from my device is coming in as ISOLatin1…

Kem_Tekinay · June 13, 2015, 10:25pm

It depends. Supposed you are trying to match a string that is ISOLatin1-encoded and you set your search pattern as UTF-8 (as you probably will). If RegEx didn’t try to convert the encoding, you wouldn’t get a match of any high-ascii characters. The native RegEx tries to protect you from that, it seems.

Kem_Tekinay · June 13, 2015, 10:28pm

No, it’s not really encoded at all, it’s just bytes, not meant to represent text. That is not to say that parts of it aren’t meant to represent text, just not the whole thing. You would be better off using NthFieldB to extract the section after the escape, define that encoding as ASCII (or even UTF-8, since it will likely contain not high-ASCII characters anyway), then match against that.

Jon_Ogden · June 13, 2015, 10:30pm

That’s similar to what I was doing before using the RegEx. It works just fine with RegExMBS and I figured out how to do the sequential searching. So for now it least the problem is solved.

But if the RegEx is properly searching for bytes, then it should match 0x1B because regardless of the encoding, 0x1B is still 0x1B and that is what the RegEx is looking for. So to me, that is a bug then…

Kem_Tekinay · June 13, 2015, 10:38pm

Could be. You should file it, but keep in mind that regular expressions are meant to match against text, not bytes per se. You aren’t feeding it text, and your use isn’t the common one.

BTW, as a rule of thumb, if the string you’re looking at isn’t meant to be human-readable, it should be considered a series of bytes, not text, so there is no TextEncoding involved.

Jon_Ogden · June 13, 2015, 10:51pm

I’d agree. However, in my case, there’s a mix of both human readable and control characters. I have no idea what the high ASCII value characters are for as they don’t serve any purpose I can figure out. I guess I’d have to ask Cisco!

It’s interesting how you say that RegEx is meant to match against text because the MBS classes, based on PRE, have you do everything in terms of the bytes. That’s why I figured it was byte based…