Speed-Challenge

Good Morning! Let’s make a competition! I am trying to improve this little function:


function WordEqualityInPercent(word1 as string, word2 as String) as integer
  // Compares two string and showing as result a percent number how much they are equal
  
  dim result as integer = 0
  dim isequal as integer = 0
  dim i as integer
  
  // compare only if strings with same lenghts
  if word1.len = word2.len then
    
    // quit if empty strings
    if word1 = "" then return 0
  
    // quit if equal
    if word1 = word2 then return 100
  
    // iterate through string counting all differences
    for i = 1 to word1.len
      if word1.mid(i,1) = word2.mid(i,1) then isequal = isequal + 1
    next
    
    // create percent number
    result = (isequal / word1.Len) * 100
    
  end if
  
  return result

end function  

It basically compares two strings char-wise and returns the equality in percentage. On 1000 iterations this function takes 8700 Microseconds (for strings “Hundekuchen” and “Hundefutter”) but for sure this can be done much faster. So here we are, let’s start a challange. The one who provides a much faster solution is the matchwinner. But there is one obstacle: No Xojo Script allowed!

So go forth and create great code… looking forward!

I’ll take a shot at it…

Function WordEqualityInPercent(word1 as string, word2 as String) As integer
  // Compares two string and showing as result a percent number how much they are equal
  
  dim result as integer = 0
  dim isequal as integer = 0
  dim i as integer
  dim w1 As MemoryBlock
  dim w2 As MemoryBlock
  dim w1Size As UInt64
  
  w1 = word1
  w2 = word2
  
  // compare only if strings with same lenghts
  if (w1.Size = w2.Size) and (w1.Size > 0) then
    
    // quit if equal
    if word1 = word2 then return 100
    
    // iterate through string counting all differences
    i = 0
    w1Size = w1.Size
    while i < w1Size
      if w1.UInt8Value(i) = w2.UInt8Value(i) then isequal = isequal + 1
      i = i + 1
    wend
    
    // create percent number
    result = (isequal / w1.Size) * 100
    
  end if
  
  return result
  
End Function

wow Alwyn this is 3xtimes better… so far :wink:
Loop 1 (old func) (1000000 iterations) = 6511667 Microseconds, result = 63%
Loop 2 (your func) (1000000 iterations) = 2091126 Microseconds, result = 63%

Nice… :wink:

No Xojo Script allowed!

why?

[quote=277840:@Alwyn Bester]// quit if equal
if word1 = word2 then return 100[/quote]
Aren’t memory blocks regular objects? If yes, this would be a comparison of the pointers to the two memory blocks. This would always be False.

I think it should be:

If word1.StringValue(0, word1.Size - 1) = word2.StringValue(0, word2.Size - 1) Then Return 100 End

You are right about the memoryblock comparisons, but keep in mind… word1 and word2 are strings. w1 and w2 are the memory blocks.

Haven’t tested it but I think

word1 = word2

should be faster then

w1.StringValue(0, w1.Size - 1) = w2.StringValue(0, w2.Size - 1)

Tomas’ original code is case-insensitive while Alwyn’s is case-sensitive. This can be overcome by converting both to uppercase before converting to a MemoryBlock.

Tomas’ original code will not be thrown off by different encodings while Alwyn’s will. This can be overcome by converting word2 to the same encoding as word1 at the top.

Using a Ptr is usually faster than accessing the MemoryBlock directly. Something like this:

      dim p1 as Ptr = w1
      dim p2 as Ptr = w2
      …
      if p1.Byte(i) = p2.Byte(i) then isequal = isequal + 1

Here is the new method using Kem’s pointers (pardon the pun):

Function WordEqualityInPercent(word1 as string, word2 as String) As integer
  // Compares two string and showing as result a percent number how much they are equal
  
  dim result as integer = 0
  dim isequal as integer = 0
  dim i as integer
  dim w1 As MemoryBlock
  dim w2 As MemoryBlock
  dim w1Size As UInt64
  
  w1 = word1
  w2 = word2
  
  dim p1 as Ptr = w1
  dim p2 as Ptr = w2
  
  // compare only if strings with same lenghts
  if (w1.Size = w2.Size) and (w1.Size > 0) then
    
    // quit if equal
    if word1 = word2 then return 100
    
    // iterate through string counting all differences
    i = 0
    w1Size = w1.Size
    while i < w1Size
      if p1.Byte(i) = p2.Byte(i) then isequal = isequal + 1
      i = i + 1
    wend
    
    // create percent number
    result = (isequal / w1.Size) * 100
    
  end if
  
  return result
  
End Function

Getting an 18% performance increase by using pointers instead of memory blocks.

Second this with this:

Loop 1 (previous function) (1000000 iterations) = 2157292 Microseconds, result = 63%
Loop 2 (latest function) (1000000 iterations) = 1994623 Microseconds, result = 63%

Kem and Alwyn you’re great!

Modifying Alwyn’s code to compare 8 bytes at a time, there can be a significant speed improvement if the two strings are very similar and reasonably large. On the other hand if they are completely different or very small in length, there is only a small performance hit.

[code]Function WordEqualityInPercent(word1 as string, word2 as String) As integer
// Compares two string and showing as result a percent number how much they are equal

dim result as integer = 0
dim isequal as integer = 0
dim i,j as integer
dim w1 As MemoryBlock
dim w2 As MemoryBlock
dim w1Size As UInt64

w1 = word1
w2 = word2

dim p1 as Ptr = w1
dim p2 as Ptr = w2

// compare only if strings with same lenghts
if (w1.Size = w2.Size) and (w1.Size > 0) then

// quit if equal
if word1 = word2 then return 100

// iterate through string counting all differences
i = 0
w1Size = w1.Size
while i < w1Size-7
  if p1.int64(i) = p2.int64(i) then
     isequal = isequal + 8
  else
    for j=i to i+7
      if p1.Byte(j)=p2.Byte(j) then isequal = isequal + 1
    next
  end if
  i = i + 8
wend
for j=i to w1Size-1
  if p1.Byte(j)=p2.Byte(j) then isequal = isequal + 1
next
// create percent number
result = (isequal / w1.Size) * 100

end if

return result

End Function
[/code]

I meant to elaborate that when you have prior knowledge of what the two strings are likely to be like, you can often tailor your compare routines to make them more efficient. If you know that the two strings are likely to be very similar, you can do one thing, and if you know that they are likely to be very different, then you can do something else. Since the question of efficiency has been raised in the first place, I’m assuming that the strings are likely quite long. Otherwise there wouldn’t likely be much point spending a lot of time on optimization.

I struggled with a function like this to compare user inputs to known database values. Initially wrote a function just like these to come up with a percent difference and if it was lower than 3% difference then it would be considered a match, but this was a faulty assumption because of things like:

“Mr. Tom Jones” <> “Mr Tom Jones” <> “Tom Jones”
“Patrick Harris” <> " Pat Harris"
“Noa Jefferson” <> “Noah Jefferson”

When there’s an extra character inserted in there, that completely throws off standard percent comparisons (unless you get clever about it and check for that, but that slows the whole thing down). I ended up modifying the Metaphone algorithm and comparing words based on how they sound phonetically rather than doing direct string comparisons.

Just a suggestion that this may be a better approach when you’re comparing names or words instead of binary data.

Hi Robert, will check your routine when back home (I am on the road right now) but to answer your questions. Why to spent so much time on this: Well, it’s a kind of fuzzy-logic I’ll use in a project. Yes it is likely that only familiar words (to be more precise: only sound-familiar words) will be compared.

You’re pointing in the right direction when talking about Metaphone though I am using “Kölner Phonetic”, a more precise approach for german language. If you like you may check this out here: http://jakobssystems.net/webapps/cgnsearch/cgnsearch.cgi

But even if you pre-select your words phonetically you’ll get a huge amount of results. And to give you hard numbers: German Dictionary has 1.7 Mio words…

Hi Christian, you’re right that Meyer <> Mayer <> Meier - all variations of same name. But this is not the problem anymore when using phonetic search algorithms. If you’re just searching for names this leads to good results. But if you query whole Dictionaries - then you will get a lot of sound-familiar words. Then you need to compare them with each other and based upon the percentage (and other factors) my software chooses one…

you see my WebApp for Kölner Phonetc above… it’s kinda Russel-SoundEx but better for german language. Here you see the results of just same Phonem 1548 … but just one out of 101 is the matching word… guess which one :wink:

This is something that I was wondering about too. If the extra character is inserted at the very beginning of the string, then it will result in a very low or even zero percent match, even though the strings might otherwise be identical. To avoid this kind of problem you could use the same kind of logic used in file compression algorithms, but it would be very slow. In any event I think there is an inevitable tradeoff between speed and accuracy.