RegEx for Unicode uppercase letters

Jonathan_Ashwell · July 12, 2013, 8:43pm

I need to tell whether a letter is upper case or not in all languages. I thought I’d write a RegEx expression for this, but when I use a \p construction an error was thrown that that the PCRE library was compiled without Unicode support (which is mentioned on the PCRE site: http://www.regular-expressions.info/pcre.html).

So two questions. 1) is there any reason why it was compiled with Unicode support, and if not should I file a feature request? and 2) is there a workaround with the current implementation of RegEx in Xojo?

John_Hansen · July 13, 2013, 4:58pm

[quote=19798:@Jonathan Ashwell]I need to tell whether a letter is upper case or not in all languages. I thought I’d write a RegEx expression for this, but when I use a \p construction an error was thrown that that the PCRE library was compiled without Unicode support (which is mentioned on the PCRE site: http://www.regular-expressions.info/pcre.html).

So two questions. 1) is there any reason why it was compiled with Unicode support, and if not should I file a feature request? and 2) is there a workaround with the current implementation of RegEx in Xojo?[/quote]

I don’t know if you have read this page: http://www.regular-expressions.info/realbasic.html

Text from link above:
“REALbasic uses the UTF-8 version of PCRE. This means that if you want to process non-ASCII data that you’ve retrieved from a file or the network, you’ll need to use REALbasic’s TextConverter class to convert your strings into UTF-8 before passing them to the RegEx object. You’ll also need to use the TextConverter to convert the strings returned by the RegEx class from UTF-8 back into the encoding your application is working with.”

Jonathan_Ashwell · July 13, 2013, 5:15pm

Thanks, I have. But that really isn’t relevant here. the characters I want to evaluate are in UTF8. In RegEx it’s possible to determine if a character is uppercase. But in the PCRE compile used by Xojo you can’t use those functions (I linked to the documentation and the error in my original post). The questions still stand.

John_Hansen · July 13, 2013, 5:41pm

I have read your link and compared it with Xojo documentation. So far I haven’t found any documentation for \p in Xojo: http://documentation.xojo.com/index.php/RegEx also Xojo’s RecEx support UTF8 by default. And when I read PCRE site it says: PCRE implements almost the entire Perl 5.8 regular expression syntax. Only the support for various Unicode properties with \p is incomplete,

Unless Xojo’s documentation is missing “\p” options It could be it’s automatic included when using Xojo RegEx class. What result do you get when you’r not using options \p ?

Jonathan_Ashwell · July 13, 2013, 5:49pm

It’s \p. If you use it in Xojo RegEx, you get the error that PCRE was compiled without Unicode support. Do you understand what I’m trying to achieve? I want to be use RegEx to tell if a character I pass it is uppercase or not. And it should work with all unicode characters, not just the ASCII subset. Do you know how to do this without using the \p RegEx commands?

John_Hansen · July 13, 2013, 6:35pm

Sorry I misunderstood your question.

There might be a workaround by use of “Uppercase” and then compare result before and after the conversion to Uppercase.

Jonathan_Ashwell · July 13, 2013, 6:40pm

I don’t think that would work – Asian characters won’t change with uppercase(), which if I understand what you’re getting at would mean I should consider them to be in uppercase and of course they’re not. It would be great if a Xojo engineer would comment on this. I’m going to go ahead and file a bug/feature report.

John_Hansen · July 13, 2013, 7:33pm

Xojo’s PCRE library is also very old (5 YEARS), version 7.7. (according to documentation).

Latest version is 8.33 (http://vcs.pcre.org/viewvc/code/tags/)

Rick_Araujo · July 13, 2013, 10:35pm

Wow. Time for an update! (I should say around 3 years ago).

PCRE is just “another thing” right now, Unicode was updated few times and many UTF patches were done.

http://vcs.pcre.org/viewvc/code/tags/pcre-8.33/ChangeLog?revision=1336&view=markup&pathrev=1336

Kem_Tekinay · July 13, 2013, 11:53pm

RegExMBS has that support. At the moment, that’s your only option, but the other benefit is that it’s faster and has more options.

Jonathan_Ashwell · July 14, 2013, 1:58pm

Feature request filed

<https://xojo.com/issue/28136>