Good Morning! Let’s make a competition! I am trying to improve this little function:
function WordEqualityInPercent(word1 as string, word2 as String) as integer
// Compares two string and showing as result a percent number how much they are equal
dim result as integer = 0
dim isequal as integer = 0
dim i as integer
// compare only if strings with same lenghts
if word1.len = word2.len then
// quit if empty strings
if word1 = "" then return 0
// quit if equal
if word1 = word2 then return 100
// iterate through string counting all differences
for i = 1 to word1.len
if word1.mid(i,1) = word2.mid(i,1) then isequal = isequal + 1
next
// create percent number
result = (isequal / word1.Len) * 100
end if
return result
end function
It basically compares two strings char-wise and returns the equality in percentage. On 1000 iterations this function takes 8700 Microseconds (for strings “Hundekuchen” and “Hundefutter”) but for sure this can be done much faster. So here we are, let’s start a challange. The one who provides a much faster solution is the matchwinner. But there is one obstacle: No Xojo Script allowed!
So go forth and create great code… looking forward!
Function WordEqualityInPercent(word1 as string, word2 as String) As integer
// Compares two string and showing as result a percent number how much they are equal
dim result as integer = 0
dim isequal as integer = 0
dim i as integer
dim w1 As MemoryBlock
dim w2 As MemoryBlock
dim w1Size As UInt64
w1 = word1
w2 = word2
// compare only if strings with same lenghts
if (w1.Size = w2.Size) and (w1.Size > 0) then
// quit if equal
if word1 = word2 then return 100
// iterate through string counting all differences
i = 0
w1Size = w1.Size
while i < w1Size
if w1.UInt8Value(i) = w2.UInt8Value(i) then isequal = isequal + 1
i = i + 1
wend
// create percent number
result = (isequal / w1.Size) * 100
end if
return result
End Function
[quote=277840:@Alwyn Bester]// quit if equal
if word1 = word2 then return 100[/quote]
Aren’t memory blocks regular objects? If yes, this would be a comparison of the pointers to the two memory blocks. This would always be False.
I think it should be:
If word1.StringValue(0, word1.Size - 1) = word2.StringValue(0, word2.Size - 1) Then
Return 100
End
Tomas’ original code is case-insensitive while Alwyn’s is case-sensitive. This can be overcome by converting both to uppercase before converting to a MemoryBlock.
Tomas’ original code will not be thrown off by different encodings while Alwyn’s will. This can be overcome by converting word2 to the same encoding as word1 at the top.
Using a Ptr is usually faster than accessing the MemoryBlock directly. Something like this:
dim p1 as Ptr = w1
dim p2 as Ptr = w2
if p1.Byte(i) = p2.Byte(i) then isequal = isequal + 1
Here is the new method using Kem’s pointers (pardon the pun):
Function WordEqualityInPercent(word1 as string, word2 as String) As integer
// Compares two string and showing as result a percent number how much they are equal
dim result as integer = 0
dim isequal as integer = 0
dim i as integer
dim w1 As MemoryBlock
dim w2 As MemoryBlock
dim w1Size As UInt64
w1 = word1
w2 = word2
dim p1 as Ptr = w1
dim p2 as Ptr = w2
// compare only if strings with same lenghts
if (w1.Size = w2.Size) and (w1.Size > 0) then
// quit if equal
if word1 = word2 then return 100
// iterate through string counting all differences
i = 0
w1Size = w1.Size
while i < w1Size
if p1.Byte(i) = p2.Byte(i) then isequal = isequal + 1
i = i + 1
wend
// create percent number
result = (isequal / w1.Size) * 100
end if
return result
End Function
Getting an 18% performance increase by using pointers instead of memory blocks.
Modifying Alwyn’s code to compare 8 bytes at a time, there can be a significant speed improvement if the two strings are very similar and reasonably large. On the other hand if they are completely different or very small in length, there is only a small performance hit.
[code]Function WordEqualityInPercent(word1 as string, word2 as String) As integer
// Compares two string and showing as result a percent number how much they are equal
dim result as integer = 0
dim isequal as integer = 0
dim i,j as integer
dim w1 As MemoryBlock
dim w2 As MemoryBlock
dim w1Size As UInt64
w1 = word1
w2 = word2
dim p1 as Ptr = w1
dim p2 as Ptr = w2
// compare only if strings with same lenghts
if (w1.Size = w2.Size) and (w1.Size > 0) then
// quit if equal
if word1 = word2 then return 100
// iterate through string counting all differences
i = 0
w1Size = w1.Size
while i < w1Size-7
if p1.int64(i) = p2.int64(i) then
isequal = isequal + 8
else
for j=i to i+7
if p1.Byte(j)=p2.Byte(j) then isequal = isequal + 1
next
end if
i = i + 8
wend
for j=i to w1Size-1
if p1.Byte(j)=p2.Byte(j) then isequal = isequal + 1
next
// create percent number
result = (isequal / w1.Size) * 100
I meant to elaborate that when you have prior knowledge of what the two strings are likely to be like, you can often tailor your compare routines to make them more efficient. If you know that the two strings are likely to be very similar, you can do one thing, and if you know that they are likely to be very different, then you can do something else. Since the question of efficiency has been raised in the first place, I’m assuming that the strings are likely quite long. Otherwise there wouldn’t likely be much point spending a lot of time on optimization.
I struggled with a function like this to compare user inputs to known database values. Initially wrote a function just like these to come up with a percent difference and if it was lower than 3% difference then it would be considered a match, but this was a faulty assumption because of things like:
“Mr. Tom Jones” <> “Mr Tom Jones” <> “Tom Jones”
“Patrick Harris” <> " Pat Harris"
“Noa Jefferson” <> “Noah Jefferson”
When there’s an extra character inserted in there, that completely throws off standard percent comparisons (unless you get clever about it and check for that, but that slows the whole thing down). I ended up modifying the Metaphone algorithm and comparing words based on how they sound phonetically rather than doing direct string comparisons.
Just a suggestion that this may be a better approach when you’re comparing names or words instead of binary data.
Hi Robert, will check your routine when back home (I am on the road right now) but to answer your questions. Why to spent so much time on this: Well, it’s a kind of fuzzy-logic I’ll use in a project. Yes it is likely that only familiar words (to be more precise: only sound-familiar words) will be compared.
You’re pointing in the right direction when talking about Metaphone though I am using “Kölner Phonetic”, a more precise approach for german language. If you like you may check this out here: http://jakobssystems.net/webapps/cgnsearch/cgnsearch.cgi
But even if you pre-select your words phonetically you’ll get a huge amount of results. And to give you hard numbers: German Dictionary has 1.7 Mio words…
Hi Christian, you’re right that Meyer <> Mayer <> Meier - all variations of same name. But this is not the problem anymore when using phonetic search algorithms. If you’re just searching for names this leads to good results. But if you query whole Dictionaries - then you will get a lot of sound-familiar words. Then you need to compare them with each other and based upon the percentage (and other factors) my software chooses one…
you see my WebApp for Kölner Phonetc above… it’s kinda Russel-SoundEx but better for german language. Here you see the results of just same Phonem 1548 … but just one out of 101 is the matching word… guess which one
This is something that I was wondering about too. If the extra character is inserted at the very beginning of the string, then it will result in a very low or even zero percent match, even though the strings might otherwise be identical. To avoid this kind of problem you could use the same kind of logic used in file compression algorithms, but it would be very slow. In any event I think there is an inevitable tradeoff between speed and accuracy.