Tip: Skipping a block in a regular expression

Sometimes you need a regular expression to match something as long as it does not occur with a certain block. For example, in a CSV file, you might want to identify commas that don’t appear within quotes. In code, you might want to find a keyword that’s not within a quote or comment.

PCRE, the engine behind Xojo’s RegEx, gives you a way to do this using the code[/code] and code[/code] verbs. Let’s use a CSV as an example where, to keep it simple, embedded quotes are escaped by doubling them up.

"[^"]*(?:"|\\z)(*SKIP)(*FAIL)|,

So what’s this do? It starts by matching a quoted string by looking for the opening quote followed by any number of characters that are not a quote, then the closing quote (or end of document). Once it’s matched that, the engine is told to SKIP, meaning, “if you have to backtrack because the next part fails, start here”. In other words, we’ve told the engine to skip the quoted string entirely.

The next verb, FAIL, does exactly what you’d think, it tells the engine that the match has failed and it should now backtrack and try again. But because of the previous SKIP verb, it cannot backtrack into the quote so it starts matching again at that point. Eventually it will match the “,” given in the alternator and succeed.

Let’s try something harder like matching a “SELECT” in SQL that does not occur within a comment or quote. Because this is more complex, I will use free-space mode code[/code] to make it more readable.

(?x)               # FREE-SPACE MODE

(?:                # non-capturing group
  " [^"]* (?:"|\\z) # identifier
  |                # or
  ' [^']* (?:'|\\z) # string
  |                # or
  --.*             # single-line comment
  |                # or
  /\\* (?:(?!\\*/).)* (?:\\*/|\\z) # multi-line comment
)                  # end non-capturing group
(*SKIP) (*FAIL)    # skip the stuff we don't need

|                  # OR

\\bSELECT\\b         # here's the part we really want

In short, then you want to match something that does not occur in something else, match the “something else” first and tell the engine to skip it.

I just added this as a RegExRX sample, btw. If you have RegExRX, just download the samples again through the Help menu.