String parsing help

Geetings,

I have some issues detecting some strings and apparently String.Uppercase or String.TitleCase provide same result which is confusing

so in my case i have the names in the format “FAMILYNAME Firstname” and i need to be able to separate the family name which are always CAPITAL letters and the First name which could be 1 or many and they could be Titlecase or small letters but never Capital.

As well FamilyName could be composed from 1 or many

I tried a lot of code and regex and it always fails . Can someone shed some light here please

Thanks

somehow the characteristic char sequence for the separation is
upper space upper lower case

NAME First
UUUUSULLLL

which genius create this into one string?
“FAMILYNAME Firstname”

I think this pattern would do it:

(?-i)^([A-Z]+) (.+)

Assuming you won’t get spaces in the FAMILYNAME, e.g., “FAMILY NAME”, this will capture the first string that is a all caps into SubExpressionString( 1 ) and the remaining characters after a space into SubExpressionString( 2 ).

I’d need more examples to refine this.

Unfortunately you will get those spaces , so far i manage it to fix it , not sure if is the best but it does the job

Private Function SplitFullName(fullName As String) As Dictionary
  // Create a dictionary to store the first name and last name
  Var result As New Dictionary
  result.Value("FirstName") = ""
  result.Value("LastName") = ""
  
  Try
    // Debug: Log the input
    System.DebugLog("Input FullName: " + fullName)
    
    // Split the full name into parts by spaces
    Var nameParts() As String = fullName.Split(" ")
    Var lastNameParts() As String
    Var firstNameParts() As String
    
    // Define regex patterns for classification
    Var rxLowerCase As New RegEx
    Var ro As New RegExOptions
    ro.CaseSensitive = True
    
    rxLowerCase.Options = ro
    rxLowerCase.SearchPattern = "^[a-z]+" // Matches fully lowercase words
    
    Var rxTitleCase As New RegEx
    rxTitleCase.Options = ro
    
    rxTitleCase.SearchPattern = "[A-Z][a-z]+" // Matches TitleCase words
    
    Var rxUpperCase As New RegEx
    rxUpperCase.Options = ro
    
    rxUpperCase.SearchPattern = "^[A-Z]+" // Matches fully uppercase words
    
    // Iterate through each part and classify
    For Each part As String In nameParts
      System.DebugLog("Processing Part: " + part)
      If rxLowerCase.Search(part) <> Nil Then
        // Fully lowercase -> FirstName
        System.DebugLog("Matched as LowerCase: " + part)
        firstNameParts.Add(part)
      ElseIf rxTitleCase.Search(part) <> Nil Then
        // TitleCase -> FirstName
        System.DebugLog("Matched as TitleCase: " + part)
        firstNameParts.Add(part)
      ElseIf rxUpperCase.Search(part) <> Nil Then
        // Fully uppercase -> LastName
        System.DebugLog("Matched as UpperCase: " + part)
        lastNameParts.Add(part)
      Else
        // Debug: Log unclassified parts
        System.DebugLog("Unclassified Part: " + part)
      End If
    Next
    
    // Reconstruct the first name and last name from the arrays
    result.Value("FirstName") = String.FromArray(firstNameParts, " ").Trim
    result.Value("LastName") = String.FromArray(lastNameParts, " ").Trim
  Catch e As RuntimeException
    System.DebugLog("Error: " + e.Message)
    // If something goes wrong, fallback to treating the full name as the first name
    result.Value("FirstName") = fullName
  End Try
  
  // Debug: Output results
  System.DebugLog("Final Parsed FirstName: " + result.Value("FirstName"))
  System.DebugLog("Final Parsed LastName: " + result.Value("LastName"))
  
  Return result
End Function

This would be basically the working code, if it can be done better i guess ideas are more than welcome here.

What I understand is that something like this is possible:

DE LA TORRE Jose Armando

FAMILYNAME: DE LA TORRE
Firstname (with middle name): Jose Armando

Maybe it can, but again, I’d need more and better examples of what we’d be matching against.

That is the Exact type of examples could have

DE LA TORRE Jose Armando

No idea who put the data this way but i guess it is messed up , but so far that code did the job.

UUSUUSUUUUUSULLLSULLLLLL

where the bold S is the split

Ah, this should do it:

(?-i)^([A-Z]+(?: [A-Z]+)*) (.+)

Again, Family Name will be in SubExpressionString( 1 ) and First Name in SubExpressionString( 2 ). This assumes there will always be both.

2 Likes

https://regex101.com/

special chars could break this that not match the A-Z (german Ä) at least in this test online form.
or double spaces.
DE LÃ TORRE Jose Armando

Good point. This should take care of both of those issues:

(?-i)^(\p{Lu}+(?: +\p{Lu}+)*) +(.+)
3 Likes

If you didn’t want ot use RegEx, you could split the line into words, then test each word’s characters. Assume a word is all Caps until a letter over Character Code (“Z”) is found. Gather the successes in one string and the failures in another.

I’m sure this can be optimized…

Var vName As String = "THIS IS A Test of First and LAST Names"
Var aWords() As String = vName.Split(" ")
Var aChars() As String

Var firstN, lastN As String = ""
Var hasLC As Boolean

For i As Integer = 0 To aWords.LastIndex
  aChars=aWords(i).Split("")
  hasLC=False
  For j As Integer = 0 To aChars.LastIndex
    If aChars(j).Asc>90 Then
      hasLC=True 
      exit
    End If
  Next
  If hasLC Then
    firstN=firstN+" "+aWords(i)
  Else 
    lastN=lastN+" "+aWords(i)
  End If
Next

MessageBox("Last="+lastN+",  First="+firstN)
1 Like