Hey guys, I have a RegEx problem I’m trying to solve and don’t understand what is happening. So here is my RegEx:
I’m trying to catch VT-100 escape characters when connecting to the console of a managed switch either via RS-232 or Telnet. Now, the RegEx works just fine in Kem’s RegExRX. Catches the escape characters and groups everything just fine.
The Xojo Regex function does NOT work. It won’t catch or identify the escape characters. However, the MonkeyBreak RegEx classes work and catch them just fine. Same exact data. Same exact search pattern. RegExRX and MBS work, Xojo does not.
I don’t know why. The characters before the escape character 0x1B, are very high ASCII characters - with ASCII values of like 253, 255, etc. One thing I do notice when copying these characters into RegExRx is that RegExRx identifies their ASCII value differently than Xojo does. Still, it shouldn’t matter. I originally was using a pattern like:
Previously and that should still work and capture things. The addition of the (\W) grouping at the beginning was done to see if I could get Xojo to capture those along with the escape characters.
So have I found some sort of a bug in the Xojo RegEx class? Maybe Kem can answer this but I thought his RegExRx uses the Xojo class.
I’d rather use the Xojo class as I find it a little simpler than MBS and I’m finding MBS seems to be insetting some spaces when I do a replace and I don’t want spaces when I replace the identified escape codes.
[quote=194013:@Christian Schmitz]MBS Class is a wrapper for PCRE and has more options. It can be harder but you should benefit from better speed and more features.
Suggestions for improvements are welcome.[/quote]
Actually, Christian, Kem and I have had discussions on this offline. I did a side by side test of MBS and Xojo RegEx speeds. In cases where there was a large amount of text to search through such as a large text document, MBS did much better. However, when dealing with small amounts of text data, such as a few lines output from a Telnet connection to a device, the Xojo class ran faster. I’ve generally stayed with the Xojo class because of it’s simplicity and the fact that I’m not generally scanning large amounts of text looking for patterns…
It’s definitely a bit more confusing and I’m having some trouble right now where I’m trying to search through data, find a match and then continue searching from where I left off. I’m using the offsets to specify where to search but in one instance, it’s not getting the starting point right and I’m ending up in an infinite loop. I should probably send you an e-mail about it directly. Just have not had the time.
I’m trying to get updates out for 3 different packages right now!
Ah. I was experimenting with viewing different encodings in the debugger and nothing in terms of the bytes changed - just the representation. Since RegEx should be looking at the bytes, making it UTF-8 centric seems kind of silly.
It depends. Supposed you are trying to match a string that is ISOLatin1-encoded and you set your search pattern as UTF-8 (as you probably will). If RegEx didn’t try to convert the encoding, you wouldn’t get a match of any high-ascii characters. The native RegEx tries to protect you from that, it seems.
No, it’s not really encoded at all, it’s just bytes, not meant to represent text. That is not to say that parts of it aren’t meant to represent text, just not the whole thing. You would be better off using NthFieldB to extract the section after the escape, define that encoding as ASCII (or even UTF-8, since it will likely contain not high-ASCII characters anyway), then match against that.
That’s similar to what I was doing before using the RegEx. It works just fine with RegExMBS and I figured out how to do the sequential searching. So for now it least the problem is solved.
But if the RegEx is properly searching for bytes, then it should match 0x1B because regardless of the encoding, 0x1B is still 0x1B and that is what the RegEx is looking for. So to me, that is a bug then…
I’d agree. However, in my case, there’s a mix of both human readable and control characters. I have no idea what the high ASCII value characters are for as they don’t serve any purpose I can figure out. I guess I’d have to ask Cisco!
It’s interesting how you say that RegEx is meant to match against text because the MBS classes, based on PRE, have you do everything in terms of the bytes. That’s why I figured it was byte based…