Regex and emojis...

Dave_Kondris · January 26, 2018, 10:06am

So, trying to parse Instagram and of course many users sprinkle their posts with emojis. And I’m not sure how to deal with them as some seem to act as double characters.

Here’s a simplified example:

The regex is (?<=(<meta\scontent=")).*(?="\s/>)

And the text is:

[code]

shao_sean · January 26, 2018, 10:15am

If you only need to capture the text between the quotes <meta content="(.*)" name="description" \\/> emojis are just another Unicode code point, so no need to even worry about handling them differently…

Dave_Kondris · January 26, 2018, 10:20am

But then why are they (apparently) throwing off the results?

The first two lines are captured perfectly, but the next one - with two kisses - one does not reach the same end. It’s missing the “on” at the end.

Also, the forum could not handle pasting them into a post, so something odd is definitely going on.

Beatrix_Willius · January 26, 2018, 11:04am

As far as I remember regex is not fully unicode aware - I’ve had problems with this, too. All characters are matched with “\pL”. See for instance https://lexikos.github.io/v1/docs/misc/RegEx-QuickRef.htm .

Christian_Schmitz · January 26, 2018, 11:05am

You can always try RegExMBS class as an alternative.

shao_sean · January 26, 2018, 11:51am

Feel free to message me a link to download a file with your test text and I’ll take a peek (or email me direct shaosean@hotmail.com)

Kem_Tekinay · January 26, 2018, 3:37pm

Looks like you found a bug in RegExRX. If you look at the Match List, the proper matches are being returned, but visually RegExRX is off in its display.

Dave_Kondris · January 26, 2018, 10:28pm

Ha, you’re right, it seems. I guess the incorrect highlights drew so much of my attention, I didn’t get past them to looking at the match results themselves, thanks!