Levenshtein Distance Algorithm

DaveS · August 22, 2018, 7:17pm

Public Function levenshtein(a as string,b as string) as integer
  Dim i As Integer
  Dim j As Integer
  Dim cost As Integer
  Dim d(-1,-1) As Integer
  
   ' Avoid calculations where there there are empty words
  If Len(a) = 0 Then Return Len(b)
  If Len(b) = 0 Then Return Len(a)
  
   ' Array initialization    
  Redim d(Len(a),Len(b))
  
  For i = 0 To Len(a)
     d(i,0) = i
   Next
  For j = 0 To Len(b)
     d(0,j) = j
  Next
  
   ' Actual calculation
  For i = 1 To Len(a)
    For j = 1 To Len(b)
      If Mid(a,i,1) = Mid(b,j,1) Then 
        cost = 0
       Else 
        cost = 1 
      End If
      d(i,j)=Min((d(i-1,j)+1),(d(i,j-1)+1),(d(i-1,j-1)+cost))
    Next j
  Next i
  
  Return d(Len(a),Len(b))
  
End Function

msgbox str(levenshtein(“S”,“S”)) = 0
msgbox str(levenshtein(“S”,“S”)) = 1
msgbox str(levenshtein(“S”,“FRED”)) = 7
so even here there isn’t a lot of granulatirty

Kem_Tekinay · August 22, 2018, 7:21pm

Len is actually kind of expensive. For short strings it probably won’t matter, but for longer ones, you might feel it. Instead, I recommend caching Len or perhaps splitting each string into arrays first.

DaveS · August 22, 2018, 7:23pm

That was a direct translation from a VBScript version… I agree it could be optimized ALOT

Björn_Eiríksson · August 23, 2018, 1:24pm

Len in Xojo is no op, it does not have to walk to the end of the string and look for NULL terminator like in c string.

Kem_Tekinay · August 23, 2018, 1:33pm

No, but (my guess is) it has to account for multi-byte characters each time. Unless they’ve changed it behind the scenes recently, it was definitely slower for large strings, but I haven’t tested in the last couple of years.

DaveS · August 23, 2018, 2:26pm

that doen’t seem to be a true statement…
I timed taking the LEN(s) where s got progressively longer, and the time to calculate it… got progressively longer

a Million calculations for a string 9k long was 69.7 seconds for 10k it was 77.9,

DaveS · August 23, 2018, 2:36pm

my bad… .mistake in test code

1000 char:91 sec.
2000 char:88.5 sec.
3000 char:87.66667 sec.
4000 char:87 sec.
5000 char:86.6 sec.
6000 char:86.16667 sec.
7000 char:85.71429 sec.
8000 char:85.375 sec.
9000 char:85.22222 sec.
10000 char:85.1 sec.

almost all identical

Björn_Eiríksson · August 23, 2018, 3:04pm

No you are right it is not true, I was thinking it from the Plugin SDK but thats the raw bytes that has counter. So I am guessing that LenB would be no op while Len not.

DaveS · August 23, 2018, 3:10pm

Nothing is “no-op”… the operation still takes time to execute… its just that the amount of time is not dependant on the length of the string (I tested with single byte data)…

Christian_Schmitz · August 23, 2018, 4:27pm

LenB just lookups, Len has to walk for UTF-8 to count the characters.