Hang in regex

I have a really weird hang doing a regex. The regex itself is relatively simple:

\b(https?://|www\.)([^<>\s]+)

It is supposed to make links out of urls. I have a really old email (ironically from @Kem_Tekinay ) where the start of the second match goes before the search start position of the first match. The result is an infinite loop because matching never finishes.

The code isn’t really new. Does anyone see what I’m doing wrong?

'make urls to links

Dim theRegex As New RegExMBS
theRegex.CompileOptionCaseLess = True
theRegex.CompileOptionDotAll = True
theRegex.CompileOptionUngreedy = False
theRegex.CompileOptionNewLineAny = True
Dim searchString As String = "\b(https?://|www\.)([^<>\s]+)"

If theRegex.Compile(searchString) Then
  Dim searchStart As Integer
  Dim punctuation As String
  Dim protocol As String
  Dim url As String
  Dim replacement As String
  
  While theRegex.Execute(theText, searchStart) > 0
    ' Get match offsets
    Dim matchStart As Integer = theRegex.OffsetCharacters(0)
    Dim matchEnd As Integer = theRegex.OffsetCharacters(1)
    
    ' Extract submatches using offsets
    Dim protocolStart As Integer = theRegex.OffsetCharacters(2)
    Dim protocolEnd As Integer = theRegex.OffsetCharacters(3)
    protocol = theText.Middle(protocolStart, protocolEnd - protocolStart)
    
    Dim urlStart As Integer = theRegex.OffsetCharacters(4)
    Dim urlEnd As Integer = theRegex.OffsetCharacters(5)
    url = theText.Middle(urlStart, urlEnd - urlStart)
    
    ' Remove punctuation at the end of the URL
    Select Case url.Right(1)
    Case ")"
      url = url.TrimRight(")")
      punctuation = ")"
    Case "."
      url = url.TrimRight(".")
      punctuation = "."
    Case ","
      url = url.TrimRight(",")
      punctuation = ","
    Case "?"
      url = url.TrimRight("?")
      punctuation = "?"
    Else
      punctuation = ""
    End Select
    
    ' Add https and change http to https
    If protocol = "www." Then
      protocol = "https://www."
    ElseIf protocol = "http://" Then
      protocol = "https://"
    End If
    
    ' Create the replacement string
    replacement = "<a href=""" + protocol + url + """>" + protocol + url + "</a>" + punctuation
    
    ' Replace the match in the text
    theText = theText.Left(matchStart) + replacement + theText.Middle(matchEnd)
    
    ' Adjust search start position for the next match
    searchStart = matchStart + replacement.Length
  Wend
End If
beep

Example:
regex hang.xojo_binary_project.zip (7.0 KB)

The example has 2 versions, one for MBS and one for the regular Xojo regex. If you don’t have MBS then comment out the MBS version.

Update: Please ignore, was wrong, see answers from Kem

Thanks for the sample project!
I think there is no need to set SearchStart and i’m not sure if replacement.Bytes is a good idea, since the string is UTF8. May this works for you?
regex hang_thk.xojo_binary_project.zip (6.6 KB)

I find it puzzling. It is as if the code ignores the ---- searchStart + 1 ---- in the line of code
matches = theRegex.Search(theText, searchStart + 1)

In the documentation for Search As RegExMatch

If you call Search with a TargetString and omit Search StartPosition , zero is assumed. If you call Search with no parameters after initially passing a TargetString , it assumes the previous TargetString and will begin the search where it left off in the previous call. This is the easiest way to find the next occurrence of SearchPattern in TargetString .

So if you decide to use the “easiest way”
So after the first run through While/Wend just use: matches = theRegex.Search()

Var counter As Integer
While searchStart < theText.Length

  // matches = theRegex.Search(theText, searchStart + 1)
  If counter = 0 Then
    matches = theRegex.Search(theText, searchStart + 1)
  Else
    matches = theRegex.Search()
  End If
 
  counter = counter + 1

The problem is, the text is changing after each search. Search without parameters doesn’t account for that.

2 Likes

I like your solution. But I remain puzzled why the logic of the OP does not work using the searchStart.

Ahh. That explains it…

I can’t get the native version to hang. But the MBS version is hanging because it keeps replacing the same text over and over, just in a different spot.

So:

http://my.url.com

becomes

<a href="https://my.url.com">https://www.myurl.com</a>

but then the search starts again at the now pushed over url, so that becomes

<a href="https://my.url.com"><a href="https://my.url.com">https://my.url.com</a></a>

and so on. I haven’t tracked why that is.

In the native version, use Bytes rather than Length in the while loop since RegEx deals with binary positions.

An alternate strategy is to scan the text first for matches into a Dictionary as a key, then set the replacements as the value for each key, then cycle through the Dictionary using theText = theText.ReplaceAll( key, dict.Value( key ) ).

1 Like

This happens for me for both the native and the MBS versions. First doing a scan then the replacements will be my next step.