RegEx: getting identical matches just once?

Hi all,

I have text like this:

3338889 eIAQDFkTDLR N-Term(Acetyl); K7(Propionyl)
181532552 kQLATkAAR N-Term(Acetyl); K1(Acetyl); K6(Acetyl)
25539186 kQLATkAAR N-Term(Acetyl); K1(Propionyl); K6(Acetyl)
2411041777 kQLATkAAR N-Term(Acetyl); K6(Acetyl)
25759934 kQLATkAAR N-Term(Propionyl); K6(Acetyl)

and I’m getting the text in brackets with this RegEx: (?U:\(.*\))

I can turn the RegExMatch into a list of used values

Acetyl
Propionyl

but is there a way to have RegEx return such a list directly?

TIA

Markus

I would create a method for this.

Creating a method just moves the problem to a different place.

It is easy enough to do with a dictionary, but the question is if RegEx can’t do it directly.

After all it seems just the thing RegEx would be for (imagine making an index of words in a book etc).

It’s not really moving the problem to a different place, it is adding functionality to RegEx that you will use. Make it generic and then reuse it everywhere. For example, create a module named “RegExHelpers” and then add this method to it:

Function FindAll(Extends r As RegEx, TargetString As String) As String()
  Dim matches() As String
  Dim match As RegExMatch = r.Search(TargetString)
  
  Do
    If match <> Nil Then
      matches.Append match.SubExpressionString(0)
    End If
    
    match = r.Search
  Loop Until match Is Nil
  
  Return matches
End Function

Now, in your app, simply do:

Dim r As New RegEx
r.SearchPattern = r.SearchPattern = "(?U:\\(.*\\))"
Dim matches() As String = r.FindAll(theText)

You can of course make as many of these helpers as you’d like, populate the array w/match objects instead of strings, etc…

Can’t edit… Forgot to mention that all “Extends” methods in a module has to be marked as Global. It’s not really a global method as it is callable only as an extension of an already instantiated RegEx object.

Well, it’s really a “Globally” callable method from anywhere you place and handle a RegEx. :stuck_out_tongue:

A little OCD optimization:

Function FindAll(Extends r As RegEx, TargetString As String) As String()
  Dim matches() As String
  Dim match As RegExMatch = r.Search(TargetString)
  
  While match <> Nil
      matches.Append match.SubExpressionString(0)
      match = r.Search
  Wend
  
  Return matches
End Function

Didn’t even think about the loop, just copied from language manual. Good call.

@Paul Lefebvre : Maybe update the language manual, http://documentation.xojo.com/index.php/RegEx.Search … Rick’s loop is better in many ways including readability.

[quote=116831:@Markus Winter]H
but is there a way to have RegEx return such a list directly?
[/quote]
What everyones saying is - no - not directly from the Reg Ex engine - you need a loop

[quote=116856:@Jeremy Cowgar]

  Dim matches() As String

Thanks, but an array won’t do what I asked it to do - you’ll need a dictionary instead as I mentioned above.

Thanks Norman - that’s what I suspected.

Well, excepting the subtle “in one step directly using RegEx” part, it made exactly what you asked.
But I suspected you wished to to ask something different.

You asked: I wished a list of used values using RegEx.

But because your mention of using Dictionaries instead Arrays, I suspect you wish a list of used values using RegEx with no repetition.

You might have overlooked the example in the original post :wink:

For some reason I am not seeing an example of your desired output, I only see a request for a list (array) of used values:

Acetyl
Propionyl

If you have not figured it out, please clarify what you wish to do and we’ll try to help accomplish that goal.

I did, in both the example and the thread title: “RegEx: getting identical matches just once?”

As Norman said it is not possible in pure RegEx, so using a dictionary like I already did was the way to go.

I’m just a bit surprised that RegEx has no way of directly doing this.

Oh, now I understand what you want! I thought you meant by just one method call, not the text just once, my bad.

Again, I would make things generic, you will likely find other uses for a method such as this. Doing it in a dictionary is likely to be costly, because of key lookups taking place each time you would add a new value. I do not have large amounts of text to benchmark against, but here is a little method that you can change and rename to your needs, and again use on any array…

Function CountUnique(list() As String) As Pair()
  Dim results() As Pair
  
  // Catch a special condition where only 1 item exists. This is in place
  // because it will run only once per call, other methods of looping through
  // data would require an If to be inserted into the main for loop executed
  // n times.
  If list.Ubound = 0 Then
    results.Append list(0) : 1
  End If
  
  // This will return on an empty list or on a list of 1
  If list.Ubound < 1 Then
    Return results
  End If
  
  list.Sort
  list.Append "" // End Of List Marker, simply triggers a change on 'current'
  
  Dim current As String = list(0)
  Dim count As Integer = 1
  
  For i As Integer = 1 To list.Ubound
    If list(i) <> current Then
      results.Append current : count
      count = 1
      current = list(i)
    Else
      count = count + 1
    End If
  Next
  
  Return results
End Function

Then in your main code, you can do:

  Dim r As New RegEx
  r.SearchPattern = "(?U:\\(.*\\))"
  
  Dim list() As String = r.FindAll(text)
  Dim counts() As Pair = CountUnique(list)
  
  For Each p As Pair In counts
    Print Str(p.Left) + " occurred " + Str(p.Right) + " times"
  Next

Here is a FindUnique (mark Global and place in the RegExHelpers module) that uses a Dictionary that you can benchmark. On my tests (with your sample text duplicated quite a few times) I see no noticeable difference in a Dictionary and Array, but it only contains two variants. I do not know how many possible variants there are, so best to test w/real data.

Function FindUnique(Extends r As RegEx, TargetString As String) As Pair()
  Dim matches As New Dictionary
  Dim match As RegExMatch = r.Search(TargetString)
  
  While match <> Nil
    Dim v As String = match.SubExpressionString(0)
    matches.Value(v) = matches.Lookup(v, 0) + 1
    match = r.Search
  Wend
  
  Dim results() As Pair
  ReDim results(matches.Count - 1)
  
  For i As Integer = 0 To results.Ubound
    Dim key As String = matches.Key(i)
    results(i) = key : matches.Value(key)
  Next
  
  Return results
End Function

I’m a little late to this party, and admittedly didn’t read every post here. Having said that, based on the OP, this will do what Markus wants by locating only the last occurrence of each matching string.

dim rx as new RegEx
rx.SearchPattern = "(?msi-U)\\(([^)]+)\\)(?!.*\\(\\g1\\))"

dim  matches() as string
dim match as RegExMatch = rx.Search( sourceText )
while match <> nil
  matches.Append match.SubExpressionString( 1 )
  match = rx.Search()
wend

It matches the string and puts into subgroup 1, then uses a negative lookahead to make sure the same string doesn’t occur later. The mode code[/code] is the same as DotMatchesAll.

Thanks Jeremy - appreciate the help!

When it comes to RegEx, Kem makes the impossible possible :slight_smile:

While I do have your excellent RegExRX, I’m not sure I would have gotten that in my life time …

Thanks everyone!