Tricky regex pattern help needed

I’m trying to match closing HTML tags with regular expressions. In this case I need to match a closing tag only if it is not preceded by a corresponding opening tag on the same line.

This is for the HTML syntax definition for my open source code editor.

Given this faux HTML:

<a class="logo" href="a.com">
abc
</a>

<script type="something">
{
"description": "Things I find interesting"
}
</script>

<script type="application/ld+json">
{
"@context": "https://schema.org",
}</script>

<script></script>

All the closing tags should match except the final </script> one since it is preceded by it’s matching opening <script>.

I have distilled down to this regex pattern (which doesn’t work):

(?=(<\/((?:\w)+)>))(?!<(\2)>)(<\/\w+>)

I believe my regex pattern is asserting there is a closing tag then checking to see that there is not a matching closing tag and then parsing the closing tag. The trouble is it is matching all the closing tags.

Try this:

(?U)<(\w+\b).*\n[\s\S]*\K(</\g1>)

Try this:

(?msi-U)<([^ >]+)([^>]+)>([^<]+)</\1>

This is close but not quite right:

I tried inserting \/ before (\w+\b) thinking that would only capture the closing tag (since I don’t want to capture the opening tag but that seems to break it capturing </a>:

@Greg_O, your pattern seems to match a bit too much:

It depends on what you need I suppose. I was aiming for “give Garry all the parts separately so he can deal with it as he wishes” so the tags, attributes and contents are all in individual matches.

You should note however that mine will not handle nested tags.

Your engine doesn’t seem to handle the \K token. If you try that in Xojo or RegExRX, it will only match the closing tag as you requested.

But you can remove \K and the closing tag will be in the second captured group.

Here is the pattern with comments, if that helps. (This can be used as-is with any engine that accepts the (?x) token):

(?x) # Free style

(?U) # NOT greedy

<(\w+\b) # Opening tag captured in g1

.* # Anything until the end of the line
\n # Make sure there is a linefeed
[\s\S]* # Anything else, until...

\K # (restart the match here, ignoring what came before)

(</\g1>) # The closing tag

Thank you so much for this help @Kem_Tekinay - I didn’t even know about the \K modifier.

I think what is happening here is that the syntax highlighting engine that I have ported from CEF is analysing one line at a time with a regex looking for a “block end”. This means it can’t scan to find a matching opening tag (which I think is what your pattern does).

With that constraint and given this HTML:

<a class="logo" href="a.com">
abc
</a>

<script type="something">
{
"description": "Things I find interesting"
}
</script>

<script type="application/ld+json">
{
"@context": "https://schema.org",
}</script>

<script></script>

The block end lines that will be analysed will be:

</a>

</script>

}</script>

<script></script>

I need a pattern that would match the first three lines but not the last line (<script></script>).

I appreciate that regex is not the best tool to parse HTML (since HTML isn’t a regular grammar) but I feel like we are tantalisingly close…

^((?!<\w+>).)*</\w+>

Different engines have different capabilities, and unfortunately, some are quite limited.

Will your engine accept (*SKIP)(*FAIL) tokens? I’m guessing not, but it would be good to confirm.

I wonder if adding a (?-m) switch to my pattern will make a difference? If the code only feeds one line at a time to the regular expression, it won’t, but again, let’s confirm.

Perhaps you can filter the matches in code? For example:

<(\w+).*</\g1|</\w+>

If the first two characters are something other than “</”, you can ignore it.

I’m just using the built-in Xojo engine…

I see, it’s just that the code is feeding one line at a time. In that case…

(?U)<(\w+).*</\g1(*SKIP)(*FAIL)|</\w+>

Edit: Added Ungreedy switch.

1 Like

Almost but that will still mark a tag as closing if there is a matching opening tag on the same line:

<script src="js/scripts.js"></script>

polite cough time to write a parser! :grin:

1 Like

YES!! You beauty Kem.

I cannot thank you enough.

I need to read about these *SKIP and *FAIL tokens though. I thought I had a decent grip on Regex but evidently not :slight_smile:

1 Like

The engine keeps an internal pointer that it will backtrack to until it determines that it cannot make a match. It will advance that pointer to the next point where it might be able to make a match, then backtrack to that point as needed, and repeat until it makes a match or runs out of text.

The (*SKIP) token says “advance the pointer to here, no backtracking.”

(*FAIL) means, “I don’t care what you think, the match failed.”

Together, it’s a way to perform a sort of reverse match, i.e., match what you don’t want, then tell the engine to skip past it.

Here is a simple use case:

"[^"]*"(*SKIP)(*FAIL)|y

This will match the “y” in xyz, but not in "xyz" because anything within quotes is skipped.

4 Likes