RegEx Pattern wanted: SmallCaps

Martin_T · April 3, 2019, 11:49pm

Hi,

I’m looking fo a pattern, which will parse everything from an uppercase character and behind until the next one. If possible, also for letters of other writing systems (Cyrillic, Asian, etc.).

(\\p{Lu}\\p{Ll}*)

This pattern only finds single Unicode words that begin with a capital letter, but nothing later until the next capital letter.

Kem_Tekinay · April 4, 2019, 2:57am

Try this:

(?U)\\p{Lu}\\X*(?=\\p{Lu}|\\z)

Martin_T · April 4, 2019, 10:05am

Excellent Kem, thank you very much. I made a small modification so that whitespace at the beginning of the string is also recognized.

(?U)\\X*\\p{Lu}\\X*(?=\\p{Lu}|\\z)

Martin_T · April 4, 2019, 10:25am

I was wrong about my change. The pattern does not work if a string starts with x…* whitespace. These would then also have to be split until the next capital letter comes.

Kem_Tekinay · April 4, 2019, 1:32pm

Do you mean this?

(?U)(^\\s+)?\\p{Lu}\\X*(?=\\p{Lu}|\\z)

Martin_T · April 4, 2019, 1:40pm

Not exactly, if a String begins with whitespaces they should be within the first match e.g.

     Hello World >      Hello  & World

Kem_Tekinay · April 4, 2019, 1:50pm

(?U)\\s*\\p{Lu}\\X*(?=\\s*(\\p{Lu}|\\z))

Martin_T · April 4, 2019, 10:53pm

Thank you Kem, the pattern is getting closer to the desired result. But it’s not perfect yet. My hint regarding spaces should only apply to spaces before the very first letter of a string. Your pattern currently returns the following:

[code] Hello & World

However, I would like to see

Hello & World[/code]

Robert_Livingston · April 5, 2019, 1:14am

Is this of any use?

(^\s*\p{Lu}|\p{Lu})([^\p{Lu}
])*

Martin_T · April 5, 2019, 1:34am

Thanks Robert, after a small modification, this looks like I was searching for.

code[/code]

Martin_T · November 5, 2019, 8:36pm

I shall return to this subject. It’s not quite what I need yet.

I need to divide a string into items. The point is to divide the string case by case, ignoring punctuation and special characters. To find lowercase unicode texts in a string, I now use the following pattern:

(?:\\p{Ll}+)

But what I need is an array of pairs with all parts, where the right value of a pair, a boolean, should indicate whether it is a RegEx match or not. So the returned array of my example string HHmmmHelloWorld123&???+*0815 should look like this:

HH : False mmm : True H : False ello : True W : False orld : True 123&?? : False ???????? : True +*0815 : False
How do I get exactly this array, since the pattern currently only finds the values marked True?

Kem_Tekinay · November 5, 2019, 9:43pm

Try this:

(\\p{Ll}+)|\\P{Ll}+

If the match has a subgroup (SubExpressionCount = 2), then it’s all lowercase. Otherwise, if SubExpressionCount = 1, it’s not.

Martin_T · November 5, 2019, 10:43pm

Kem, you’re our RegEx God, thank you so much, it works like candy.