RegEx Pattern wanted: SmallCaps

  1. 7 months ago

    Martin T

    3 Apr 2019 Pre-Release Testers Germany
    Edited 7 months ago

    Hi,

    I'm looking fo a pattern, which will parse everything from an uppercase character and behind until the next one. If possible, also for letters of other writing systems (Cyrillic, Asian, etc.).

    (\p{Lu}\p{Ll}*)

    This pattern only finds single Unicode words that begin with a capital letter, but nothing later until the next capital letter.

    -image-

    Try this:

    (\p{Ll}+)|\P{Ll}+

    If the match has a subgroup (SubExpressionCount = 2), then it's all lowercase. Otherwise, if SubExpressionCount = 1, it's not.

  2. Kem T

    3 Apr 2019 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    Try this:

    (?U)\p{Lu}\X*(?=\p{Lu}|\z)
  3. Martin T

    4 Apr 2019 Pre-Release Testers Germany

    Excellent Kem, thank you very much. I made a small modification so that whitespace at the beginning of the string is also recognized.

    (?U)\X*\p{Lu}\X*(?=\p{Lu}|\z)
  4. Martin T

    4 Apr 2019 Pre-Release Testers Germany

    I was wrong about my change. The pattern does not work if a string starts with x..* whitespace. These would then also have to be split until the next capital letter comes.

  5. Kem T

    4 Apr 2019 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut

    Do you mean this?

    (?U)(^\s+)?\p{Lu}\X*(?=\p{Lu}|\z)
  6. Martin T

    4 Apr 2019 Pre-Release Testers Germany

    Not exactly, if a String begins with whitespaces they should be within the first match e.g.

    “     Hello World“ > „     Hello “ & „World“
  7. Kem T

    4 Apr 2019 Pre-Release Testers, Xojo Pro, XDC Speakers Connecticut
    (?U)\s*\p{Lu}\X*(?=\s*(\p{Lu}|\z))
  8. Martin T

    4 Apr 2019 Pre-Release Testers Germany

    Thank you Kem, the pattern is getting closer to the desired result. But it's not perfect yet. My hint regarding spaces should only apply to spaces before the very first letter of a string. Your pattern currently returns the following:

    “     Hello“ & “ World“
    
    However, I would like to see
    
    “     Hello “ & “World“
  9. Robert L

    4 Apr 2019 Federal Way, WA (Seattle Area)

    Is this of any use?

    (^\s*\p{Lu}|\p{Lu})([^\p{Lu}\n])*

  10. Martin T

    4 Apr 2019 Pre-Release Testers Germany

    Thanks Robert, after a small modification, this looks like I was searching for.

    ((?:^\s*\p{Lu}|\p{Lu})(?:[^\p{Lu}\n])*)
  11. 6 days ago

    Martin T

    Nov 5 Pre-Release Testers Germany

    I shall return to this subject. It's not quite what I need yet.

    I need to divide a string into items. The point is to divide the string case by case, ignoring punctuation and special characters. To find lowercase unicode texts in a string, I now use the following pattern:

    (?:\p{Ll}+)

    But what I need is an array of pairs with all parts, where the right value of a pair, a boolean, should indicate whether it is a RegEx match or not. So the returned array of my example string HHmmmHelloWorld123&?Ббрусский+*0815 should look like this:

    HH       : False
    mmm      : True
    H        : False
    ello     : True
    W        : False
    orld     : True
    123&?Б   : False
    брусский : True
    +*0815   : False

    How do I get exactly this array, since the pattern currently only finds the values marked True?

  12. Kem T

    Nov 5 Pre-Release Testers, Xojo Pro, XDC Speakers Answer Connecticut

    Try this:

    (\p{Ll}+)|\P{Ll}+

    If the match has a subgroup (SubExpressionCount = 2), then it's all lowercase. Otherwise, if SubExpressionCount = 1, it's not.

  13. Martin T

    Nov 5 Pre-Release Testers Germany

    Kem, you're our RegEx God, thank you so much, it works like candy.

or Sign Up to reply!