Regex Capture Issue

I need to capture instances of two or more consecutive "X"s, two or more consecutive "Y"s, and two or more consecutive "Z"s in a string. A typical input string might be:

BXXX-CYY-DZZZZ

They will always be in X-Y-Z order. The Ys and Zs are optional, but there will never be a ZZ without a YY.

This search pattern:

(X{2,})?(Y{2,})?(Z{2,})?

works fine when I paste it into https://regex101.com/

capturing all three groups.

But in Xojo, the match has only two subexpression strings, both of which are Xs - it seems to bail out after the first match, despite Greedy being True.

Maybe the "?"s that I’m using to denote optionality are toggling Greedy? How do I avoid this?

SubExpressionString(0) is always the whole string match, as noted in the documentation.

Your expression is finding 3 different results, not one result with three parts.

Yes, I know.

Xojo is giving me SubexpressionString(0), which is the whole match “XXX” and only one of the three capture groups I’ve asked for, in SubexpressionString(1).

I expect it to give me four SubexpressionStrings: 0, the whole match, and one for each of the three capture groups, i.e. “XXX”, “YY”, and “ZZZZ”, which in fact is what regex101 gives.

No, you are reading regex101 wrong. Disable /g and you’ll see what your expression actually means.

Edit: Sometimes if you provide your sample data set (not just one example) and what you’re expecting someone will volunteer an expression that works.

I can’t speak to how that site works, but your pattern won’t work as written because it doesn’t account for the characters between the X and Y’s.

Try this instead:

(X{2,})(?:.*(Y{2,})(?:.*(Z{2,}))?)?
1 Like

Closer, but not catching one of the Ys:

image

The test string in this example didn’t have any Zs, I need to add to the test.

Claude thinks it should work:

"Here is an analysis of the given regex:

  • (X{2,}) - Matches 2 or more occurrences of the letter ‘X’. This is captured in a group.

  • (?:.\*(Y{2,}))? - An optional non-capturing group matching:

    • . - Any single character
    • * - Zero or more instances of the preceding character
    • (Y{2,}) - 2 or more occurrences of the letter ‘Y’ captured in a group
  • (?:.\*(Z{2,}))? - Another optional non-capturing group matching:

    • . - Any single character
    • * - Zero or more instances of the preceding character
    • (Z{2,}) - 2 or more occurrences of the letter ‘Z’ captured in a group

So in summary, this regex is trying to match:

  • 2 or more ‘X’ characters captured in group 1
  • Optionally followed by any character any number of times, followed by 2 or more ‘Y’ characters captured in group 2
  • Optionally followed by any character any number of times, followed by 2 or more ‘Z’ characters captured in group 3

It will match strings like:

  • XXX (only group 1)
  • XXX*YYYY (group 1 and 2)
  • XXX*YYYY*ZZZZZ (group 1, 2 and 3)

The ? after each optional non-capturing group allows those sections to not be present and still match."

I’m missing something about this, but rather than pull out (what’s left of) my hair, try this:

(?|(XX+)[^Y]*(YY+)[^Z]*(ZZ+)|(XX+)[^Y]*(YY+)|(XX+))

I think you should demand a larger data set, since the match failure was clearly on a string format that was not provided.

That works, thanks! Don’t know why your first one didn’t, as it sure looks like it should, but “works” is good enough :slight_smile:

You’re welcome, but I see what went wrong. Using Greedy with .* caused it to overmatch, “eating” the first characters.

This works too (formatted with free-style):

(?x)
(X{2,})
(?:[^Y]*
  (Y{2,})
  (?:[^Z]*
    (Z{2,})
  )?
)?

I violated my own admonition against using .*. That’ll teach me.

2 Likes

More elegant :slight_smile: