RegEx Pattern wanted: SmallCaps

  1. last year

    Martin T

    3 Apr 2019 Testers Germany
    Edited last year

    Hi,

    I'm looking fo a pattern, which will parse everything from an uppercase character and behind until the next one. If possible, also for letters of other writing systems (Cyrillic, Asian, etc.).

    (\p{Lu}\p{Ll}*)

    This pattern only finds single Unicode words that begin with a capital letter, but nothing later until the next capital letter.

    -image-

    Try this:

    (\p{Ll}+)|\P{Ll}+

    If the match has a subgroup (SubExpressionCount = 2), then it's all lowercase. Otherwise, if SubExpressionCount = 1, it's not.

  2. Kem T

    3 Apr 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    Try this:

    (?U)\p{Lu}\X*(?=\p{Lu}|\z)
  3. Martin T

    4 Apr 2019 Testers Germany

    Excellent Kem, thank you very much. I made a small modification so that whitespace at the beginning of the string is also recognized.

    (?U)\X*\p{Lu}\X*(?=\p{Lu}|\z)
  4. Martin T

    4 Apr 2019 Testers Germany

    I was wrong about my change. The pattern does not work if a string starts with x..* whitespace. These would then also have to be split until the next capital letter comes.

  5. Kem T

    4 Apr 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut

    Do you mean this?

    (?U)(^\s+)?\p{Lu}\X*(?=\p{Lu}|\z)
  6. Martin T

    4 Apr 2019 Testers Germany

    Not exactly, if a String begins with whitespaces they should be within the first match e.g.

    “     Hello World“ > „     Hello “ & „World“
  7. Kem T

    4 Apr 2019 Testers, Xojo Pro, XDC Speakers, MVP Connecticut
    (?U)\s*\p{Lu}\X*(?=\s*(\p{Lu}|\z))
  8. Martin T

    4 Apr 2019 Testers Germany

    Thank you Kem, the pattern is getting closer to the desired result. But it's not perfect yet. My hint regarding spaces should only apply to spaces before the very first letter of a string. Your pattern currently returns the following:

    “     Hello“ & “ World“
    
    However, I would like to see
    
    “     Hello “ & “World“
  9. Robert L

    4 Apr 2019 XDC Speakers Federal Way, WA (Seattle Area)

    Is this of any use?

    (^\s*\p{Lu}|\p{Lu})([^\p{Lu}\n])*

  10. Martin T

    4 Apr 2019 Testers Germany

    Thanks Robert, after a small modification, this looks like I was searching for.

    ((?:^\s*\p{Lu}|\p{Lu})(?:[^\p{Lu}\n])*)
  11. 7 months ago

    Martin T

    5 Nov 2019 Testers Germany

    I shall return to this subject. It's not quite what I need yet.

    I need to divide a string into items. The point is to divide the string case by case, ignoring punctuation and special characters. To find lowercase unicode texts in a string, I now use the following pattern:

    (?:\p{Ll}+)

    But what I need is an array of pairs with all parts, where the right value of a pair, a boolean, should indicate whether it is a RegEx match or not. So the returned array of my example string HHmmmHelloWorld123&?Ббрусский+*0815 should look like this:

    HH       : False
    mmm      : True
    H        : False
    ello     : True
    W        : False
    orld     : True
    123&?Б   : False
    брусский : True
    +*0815   : False

    How do I get exactly this array, since the pattern currently only finds the values marked True?

  12. Kem T

    5 Nov 2019 Testers, Xojo Pro, XDC Speakers, MVP Answer Connecticut

    Try this:

    (\p{Ll}+)|\P{Ll}+

    If the match has a subgroup (SubExpressionCount = 2), then it's all lowercase. Otherwise, if SubExpressionCount = 1, it's not.

  13. Martin T

    5 Nov 2019 Testers Germany

    Kem, you're our RegEx God, thank you so much, it works like candy.

or Sign Up to reply!