Levenshtein Distance Algorithm

Public Function levenshtein(a as string,b as string) as integer
  Dim i As Integer
  Dim j As Integer
  Dim cost As Integer
  Dim d(-1,-1) As Integer
   ' Avoid calculations where there there are empty words
  If Len(a) = 0 Then Return Len(b)
  If Len(b) = 0 Then Return Len(a)
   ' Array initialization    
  Redim d(Len(a),Len(b))
  For i = 0 To Len(a)
     d(i,0) = i
  For j = 0 To Len(b)
     d(0,j) = j
   ' Actual calculation
  For i = 1 To Len(a)
    For j = 1 To Len(b)
      If Mid(a,i,1) = Mid(b,j,1) Then 
        cost = 0
        cost = 1 
      End If
    Next j
  Next i
  Return d(Len(a),Len(b))
End Function

msgbox str(levenshtein(“S”,“S”)) = 0
msgbox str(levenshtein(“S”,“S”)) = 1
msgbox str(levenshtein(“S”,“FRED”)) = 7
so even here there isn’t a lot of granulatirty

Len is actually kind of expensive. For short strings it probably won’t matter, but for longer ones, you might feel it. Instead, I recommend caching Len or perhaps splitting each string into arrays first.

That was a direct translation from a VBScript version… I agree it could be optimized ALOT

Len in Xojo is no op, it does not have to walk to the end of the string and look for NULL terminator like in c string.

No, but (my guess is) it has to account for multi-byte characters each time. Unless they’ve changed it behind the scenes recently, it was definitely slower for large strings, but I haven’t tested in the last couple of years.

that doen’t seem to be a true statement…
I timed taking the LEN(s) where s got progressively longer, and the time to calculate it… got progressively longer

a Million calculations for a string 9k long was 69.7 seconds for 10k it was 77.9,

my bad… .mistake in test code

1000 char:91 sec.
2000 char:88.5 sec.
3000 char:87.66667 sec.
4000 char:87 sec.
5000 char:86.6 sec.
6000 char:86.16667 sec.
7000 char:85.71429 sec.
8000 char:85.375 sec.
9000 char:85.22222 sec.
10000 char:85.1 sec.

almost all identical :slight_smile:

No you are right it is not true, I was thinking it from the Plugin SDK but thats the raw bytes that has counter. So I am guessing that LenB would be no op while Len not.

Nothing is “no-op”… the operation still takes time to execute… its just that the amount of time is not dependant on the length of the string (I tested with single byte data)…

LenB just lookups, Len has to walk for UTF-8 to count the characters.