Regex to identify consecutive initials in name and add period separators

Jonathan_Ashwell · March 27, 2024, 10:36pm

I need to detect first “names” that contain or consist of capitalized initials and separate the initials with periods. For example

RR → R.R.
R R → leave alone
RRe → leave as is
Edward RR → Edward R.R.
ÉÉ → É.É.

As I see it, the rules would be that the search would find one or more contiguous letters surrounded by word boundaries.

It doesn’t sound hard, but when I try my (amateurish) solutions in RegExRX they all fail.

Any suggestions?

brian_franco · March 27, 2024, 10:45pm

Hopefully @Kem_Tekinay reads this as he is fluent in Klingon

TimStreater · March 27, 2024, 11:00pm

Should be RR → R. R. IMO.

Jonathan_Ashwell · March 27, 2024, 11:09pm

Once there’s a regex pattern you can modify it any way you want.

Julia_Truchsess · March 27, 2024, 11:45pm

I pasted your query nearly verbatim into Claude AI and this is its response (take with a grain of salt and test it; I did not test):

"Here’s a PCRE regex that should match and transform first “names” containing or consisting of capitalized initials, separating the initials with periods:

/\b([A-ZÉ])(?=\1*\b)(?!\1*[a-zé])/g

Explanation:

\b: Matches a word boundary, ensuring the pattern is not part of a larger word
([A-ZÉ]): Matches and captures a single uppercase letter (including É)
(?=\1*\b): Positive lookahead to ensure the captured letter is followed by zero or more occurrences of the same letter until a word boundary
(?!\1*[a-zé]): Negative lookahead to ensure the captured letter is not followed by any lowercase letters (including é)
/g: Global flag to match all occurrences in the string

To use this regex for the replacement, you can use a callback function that replaces the matched initials with the captured group followed by a period:

function replacer(match, p1) {
    return p1 + ".";
}

str.replace(/\b([A-ZÉ])(?=\1*\b)(?!\1*[a-zé])/g, replacer);

Here’s how the examples you provided would be transformed:

RR → R.R.
R R → R R (left alone)
RRe → RRe (left as is)
Edward RR → Edward R.R.
ÉÉ → É.É.
"

Paul_Chance · March 28, 2024, 2:04am

GPT-4 says:

Creating a regex to detect names with capitalized initials and manipulate them by inserting periods between the initials can be a bit tricky due to the inherent variability of natural language text. Here’s a regex that should work for the examples you provided:

\b([A-Z])(?=[A-Z\b])\b

This regex will look for a word boundary \b, followed by an uppercase letter ([A-Z]), and will use lookahead assertion (?=[A-Z\b]) to ensure that it is followed by a uppercase letter or a word boundary without consuming any characters.

To use this regex in a programming language like Python to insert periods, you can use a function like this:

import re

def insert_periods(text):
    def replace(match):
        # This function will be called for each match
        initial = match.group(0)
        return initial + '.' if initial else initial

    # Regular expression to match uppercase initials without a period after them
    pattern = r'\b([A-Z])(?=[A-Z\b])\b'

    # Replace each match in the text with the initial followed by a period
    return re.sub(pattern, replace, text)

# Test cases
print(insert_periods("RR"))   

*** I didn't allow for a longer answer so I asked GPT-4 to continue ***

Certainly! I will complete the Python code snippet for you and show how it works with the test cases:

```python
import re

def insert_periods(text):
    def replace(match):
        # This function will be called for each match
        initial = match.group()
        return initial + '.' if initial else initial

    # Regular expression to match standalone uppercase initials
    pattern = r'\b([A-Z])(?=[A-Z\b])\b'

    # Replace each standalone uppercase initial with the initial followed by a period
    return re.sub(pattern, replace, text)

# Test cases
print(insert_periods("RR"))         # "R. R."
print(insert_periods("R R"))        # "R R"
print(insert_periods("RRe"))        # "RRe"
print(insert_periods("Edward RR"))  # "Edward R. R."
print(insert_periods("ÉÉ"))         # "É.É."

With this code, you should see the following output:

R. R.
R R
RRe
Edward R. R.
É.É.

This meets the criteria you specified in your examples. The regex \b([A-Z])(?=[A-Z\b])\b ensures that only standalone uppercase letters that are followed by another uppercase letter or a word boundary are selected. The lookahead assertion (?=[A-Z\b]) checks for the next uppercase letter or a word boundary without actually consuming any characters, which allows for each standalone letter to be processed separately.

I see it’s only looking at A-Z even though I give it your É example in the question. We don’t have to worry about AI yet.

Robert_Livingston · March 28, 2024, 3:34am

In something like BBEdit:

Find: \b([A-Z])([A-Z])\b
Replace: \1.\2.

The above works for simple ASCII capital letters.

The problem is ÉÉ

The below will solve for the specific non-Ascii É letter but plausibly you would also want other capitalized letters with accents to “work”.

Find: \b([A-ZÉ])([A-ZÉ])\b
Replace: \1.\2.

What if you wanted ÓÓ to do the same? You would have to add these individual accented capitalized letters. Perhaps there are few enough of these in your problem text that this can be done, but it does not feel like a very general solution.

Jonathan_Ashwell · March 28, 2024, 11:24am

Thank you all very much for your examples. The problem seems to be that accented characters won’t match with these (the É I gave was just an example, it needs be generalized). Looking online a bit more it seems that Unicode patterns might work, something I’d have to explore. In any case, I appreciate the time you took to investigate this and the detailed answers you gave. It may be simpler in the end to do this in code, marching through the string and testing each letter for case. It looks like Kem is safe from the AIs…for now.

Kem_Tekinay · March 28, 2024, 11:51am

There are quite a few regular expression engines out there, and while I don’t know them all, it’s hard to imagine any more capable than PCRE, the one built into Xojo.

PCRE offers Unicode scripts that will do what you want, but (as I just discovered) does not play well with standard tokens like \b (word boundary). With that in mind, I came up with this:

rx.SearchPattern = "(?<=^|\s)(\p{Lu})(\p{Lu})(?=$|\s)"
rx.ReplacementPattern = "$1. $2."

This looks for two consecutive uppercase letters (\p{Lu}) that are preceded by either the start of line or some whitespace and followed by the end of line or some whitespace. It captures each letter into a group.

The replacement pattern replaces those with “X. X.”.

The code output from RegExRX for your convenience:

dim rx as new RegEx
rx.SearchPattern = "(?<=^|\s)(\p{Lu})(\p{Lu})(?=$|\s)"
rx.ReplacementPattern = "$1. $2."

dim rxOptions as RegExOptions = rx.Options
rxOptions.ReplaceAllMatches = true

dim replacedText as string = rx.Replace( sourceText )

Craig_Boyd · March 28, 2024, 12:14pm

Qapla’!

Jonathan_Ashwell · March 28, 2024, 6:43pm

@Kem_Tekinay Thank you very much. I was playing in RegExRX with \p{Lu} and made the same discovery that \b did not work. Your solution does indeed work, it’s clever. But I think I didn’t explain quite enough and my examples weren’t sufficiently diverse. I need a general solution that covers an arbitrary number of consecutive initials. Realistically, it would probably range from 1 to 3, but in some countries there may be complex names that, when “initialized”, could have even more. The example is hardcoded for two initials. I’ll use it as a starting point, and if I figure it out I’ll post back here.

Kem_Tekinay · March 28, 2024, 7:19pm

You can’t do that entirely in regular expressions, but you could in code.

Start with identifying all the items that need to be replaced and stuffing them into a Dictionary. The pattern would be:

(?<=^|\s)\p{Lu}{2,}(?=$|\s)

Cycle through the Dictionary keys and replace the text with the initials formatted as you’d like.

for each key as string in matchDict.Keys
  var initials as string = String.FromArray( key.Split(""), ". " ) + "."

  rx.SearchPattern = "(?<=^|\s)" + key + "(?=$|\s)"
  rx.ReplacementPattern = initials
  data = rx.Replace( data )
next

(Not tested.)

With slight modification, you could use a Set instead of a Dictionary. Or you could use an array if you don’t mind duplicates.