RegEx Pattern wanted: SmallCaps

Hi,

I’m looking fo a pattern, which will parse everything from an uppercase character and behind until the next one. If possible, also for letters of other writing systems (Cyrillic, Asian, etc.).

(\\p{Lu}\\p{Ll}*)

This pattern only finds single Unicode words that begin with a capital letter, but nothing later until the next capital letter.

Try this:

(?U)\\p{Lu}\\X*(?=\\p{Lu}|\\z)

Excellent Kem, thank you very much. I made a small modification so that whitespace at the beginning of the string is also recognized.

(?U)\\X*\\p{Lu}\\X*(?=\\p{Lu}|\\z)

I was wrong about my change. The pattern does not work if a string starts with x…* whitespace. These would then also have to be split until the next capital letter comes.

Do you mean this?

(?U)(^\\s+)?\\p{Lu}\\X*(?=\\p{Lu}|\\z)

Not exactly, if a String begins with whitespaces they should be within the first match e.g.

“     Hello World“ > „     Hello “ & „World“
(?U)\\s*\\p{Lu}\\X*(?=\\s*(\\p{Lu}|\\z))

Thank you Kem, the pattern is getting closer to the desired result. But it’s not perfect yet. My hint regarding spaces should only apply to spaces before the very first letter of a string. Your pattern currently returns the following:

[code]“ Hello“ & “ World“

However, I would like to see

“ Hello “ & “World“[/code]

Is this of any use?

(^\s*\p{Lu}|\p{Lu})([^\p{Lu}
])*

Thanks Robert, after a small modification, this looks like I was searching for.

code[/code]

I shall return to this subject. It’s not quite what I need yet.

I need to divide a string into items. The point is to divide the string case by case, ignoring punctuation and special characters. To find lowercase unicode texts in a string, I now use the following pattern:

(?:\\p{Ll}+)

But what I need is an array of pairs with all parts, where the right value of a pair, a boolean, should indicate whether it is a RegEx match or not. So the returned array of my example string HHmmmHelloWorld123&???+*0815 should look like this:

HH : False mmm : True H : False ello : True W : False orld : True 123&?? : False ???????? : True +*0815 : False
How do I get exactly this array, since the pattern currently only finds the values marked True?

Try this:

(\\p{Ll}+)|\\P{Ll}+

If the match has a subgroup (SubExpressionCount = 2), then it’s all lowercase. Otherwise, if SubExpressionCount = 1, it’s not.

Kem, you’re our RegEx God, thank you so much, it works like candy.