So, trying to parse Instagram and of course many users sprinkle their posts with emojis. And I’m not sure how to deal with them as some seem to act as double characters.
Here’s a simplified example:
The regex is (?<=(<meta\scontent=")).*(?="\s/>)
And the text is:
If you only need to capture the text between the quotes
<meta content="(.*)" name="description" \\/> emojis are just another Unicode code point, so no need to even worry about handling them differently…
But then why are they (apparently) throwing off the results?
The first two lines are captured perfectly, but the next one - with two kisses - one does not reach the same end. It’s missing the “on” at the end.
Also, the forum could not handle pasting them into a post, so something odd is definitely going on.
As far as I remember regex is not fully unicode aware - I’ve had problems with this, too. All characters are matched with “\pL”. See for instance https://lexikos.github.io/v1/docs/misc/RegEx-QuickRef.htm .
You can always try RegExMBS class as an alternative.
Feel free to message me a link to download a file with your test text and I’ll take a peek (or email me direct email@example.com)
Looks like you found a bug in RegExRX. If you look at the Match List, the proper matches are being returned, but visually RegExRX is off in its display.
Ha, you’re right, it seems. I guess the incorrect highlights drew so much of my attention, I didn’t get past them to looking at the match results themselves, thanks!