fuzzy IndexOf, IndexOfBestMatch code help / optimisation

Aaron_Hunt · August 14, 2017, 7:30pm

Some languages have an integer array search option that’s like IndexOf, but will return the closest match when an exact match doesn’t exist.

I’ve made my own IndexOfBestMatch method using a binary tree model, but it seems clunky to me.

Does something like this already exist in Xojo, or MBS plugins, or has anyone else coded their own?

Thanks.

Norman_P · August 14, 2017, 7:57pm

depends on how you define “closest match”

levenshtein distance is decent for a certain kind of similarity
its what we use in the IDE filter, abiove the navigator, to list results in most to least similar order

metaphone is also useful but again it depends on how you define “closest match”

Aaron_Hunt · August 14, 2017, 10:50pm

Sorry, I think we’re at cross purposes … I’m talking about integers, not strings.

Say I have an array of 50 integers n() that does not contain the value 100, but it does contain the value 98. When I say x = n.IndexofBestMatch(100), the function returns the index of value 98 in the array.

My code to do this is a “divide and conquer” binary tree search. I’ll just post it here and see if anyone would like to help make it better (or fix it if it’s wrong or badly done).

It calls one external method to clone an array if necessary, otherwise it’s self-contained.

Public Function IndexOfBestMatch(theArray() as integer, ValueToMatch as integer, arrayIsSorted as Boolean = false, favorHigherValue as Boolean = true, threshold as integer = -1) as integer
  #pragma DisableBackgroundTasks
  #pragma DisableBoundsChecking
  
  // RETURN THE INDEX OF THE CLOSEST VALUE IN THE ARRAY
  ' It is assumed that the elements are unique.
  ' A value 0 <= x >= Ubound is always returned, 
  ' unless a threshold is given, in which case -1 may be returned
  ' only when the best match does not meet the given threshold
  if theArray.Ubound < 1 then return 0
  dim test as integer = theArray.IndexOf( ValueToMatch )
  if test > -1 then return test
  
  ' to do a binary search the array must be sorted
  dim clonedArray(), correctIndexes() as integer
  if not arrayIsSorted then
    ' The array isn't sorted, so we clone it and sort the clone.
    ' We keep an array of the unsorted indexes so that we can
    ' return the correct index, not the index of the sorted clone
    clonedArray() = CloneArray( theArray() )
    for j as integer = 0 to clonedArray.Ubound
      correctIndexes.append j
    next
    clonedArray.sortwith( correctIndexes() )
  else
    ' the array is already sorted, 
    ' so the index to return matches its given order
    clonedArray() = theArray()
    for j as integer = 0 to clonedArray.Ubound
      correctIndexes.append j
    next
  end if
  
  ' quick binary tree search, start in the middle
  dim range as integer = clonedArray.Ubound+1
  dim i as integer = ceil(range/2)
  dim thisDiff, lastDiff, lastIndex as integer
  while i > -1 and i <= clonedArray.Ubound
    thisDiff = ValueToMatch - clonedArray( i )
    ' if the range is 1, then there is nowhere else to go, so exit
    if range = 1 then exit while
    lastDiff = thisDiff
    lastIndex = i
    range = ceil(range/2)
    if thisDiff < 0 then
      i = i - ceil(range/2)
    else
      i = i + ceil(range/2)
    end if
  wend
  
  ' case for arrays of size 1 or 2
  if i > clonedArray.Ubound then return clonedArray.Ubound
  
  if abs( thisDiff ) = abs( lastDiff ) then
    // IT'S A TIE
    ' return the value favored, within bounds
    if threshold = -1 then
      if favorHigherValue then
        if abs( thisDiff ) < abs( lastDiff ) then
          return max( 0 , min( correctIndexes(i), clonedArray.Ubound ) )
        else
          return max( 0 , min( correctIndexes(lastIndex), clonedArray.Ubound ) )
        end if
      else
        if abs( thisDiff ) < abs( lastDiff ) then
          return max( 0 , min( correctIndexes(lastIndex), clonedArray.Ubound ) )
        else
          return max( 0 , min( correctIndexes(i), clonedArray.Ubound ) )
        end if
      end if
    else
      ' best match did not meet threshold
      return -1
    end if
    
  else
    // NO TIE
    ' return the result, within bounds
    if threshold = -1 then
      if abs( thisDiff ) < abs( lastDiff ) then
        return max( 0 , min( correctIndexes(i), clonedArray.Ubound ) )
      else
        return max( 0 , min( correctIndexes(lastIndex), clonedArray.Ubound ) )
      end if
    else
      ' best match did not meet threshold
      return -1
    end if
  end if
End Function

Norman_P · August 14, 2017, 11:01pm

ah you’re right
I’m talking string arrays
I misread your post obviously

Aaron_Hunt · August 17, 2017, 5:20pm

bump

Kem_Tekinay · August 17, 2017, 7:23pm

What if you:

Create an array of correctIndexes and an array of absolute differences to the desired value.
Sort the diffs with the indexes.
Return correctIndexes( 0 ).

Norman_P · August 17, 2017, 7:35pm

if the array is sorted the binary search should be pretty good

where these is no exact match then the items above or below the last index you looked at should be “closest” and if they are within tolerance then you return one

which is what you’ve done

in the unsorted case kems suggestion might prove faster than clone, sort & search

Aaron_Hunt · August 17, 2017, 7:46pm

[quote=346162:@Kem Tekinay]What if you:

Create an array of correctIndexes and an array of absolute differences to the desired value.
Sort the diffs with the indexes.
Return correctIndexes( 0 ).[/quote]

Thanks, Kem. Why didn’t I think of that? So, this is the simplest case, I think (untested) …

[code]dim absDiffs(), correctIndexes() as integer
for j as integer = 0 to theArray.Ubound
correctIndexes.append j
absDiffs.append abs( ValueToMatch - theArray(j) )
next
absDiffs.sortwith( correctIndexes() )

return correctIndexes(0)[/code]

And I would just need to extend it a little with the threshold options I worked into the tree-search method. Let me fire this thing up and do some tests. I’ll report back …

Aaron_Hunt · August 17, 2017, 8:10pm

I need to make more notes to myself. Kem, now I remember when I wrote the tree-search method a few years ago, I did what you suggested first. Now I realise why I didn’t end up using that …

Main problem is that it doesn’t give consistent results in cases where there is a “tie” between values, because sorting an array with duplicates doesn’t always sort in the same order (I think? I think I tested that before, but I need to test it again). There is also no way to work in the favorHigherValue feature of the tree-search method, unless I’m missing something, which I probably am?

Aaron_Hunt · August 17, 2017, 8:26pm

(1) Tests on duplicates sorting using sortwith() shows consistent results (I don’t suppose this behaviour would have changed in the past few years, so either I did something wrong when I tested it a few years ago, or I’m doing something wrong now).

(2) still not sure how to work in a favorHigherValue option …

Christian_Schmitz · August 17, 2017, 9:30pm

MBS Plugin has JaroWinklerDistanceMBS and LevenshteinDistanceMBS functions.
You could try it on all strings in a list and pick the best rating one.

Norman_P · August 17, 2017, 9:47pm

If you read the other posts you see we’ve already covered this
String distances aren’t useful

[quote=346202:@Christian Schmitz]MBS Plugin has JaroWinklerDistanceMBS and LevenshteinDistanceMBS functions.
You could try it on all strings in a list and pick the best rating one.[/quote]

[quote=345521:@Aaron Hunt]Sorry, I think we’re at cross purposes … I’m talking about integers, not strings.
[/quote]

Christian_Schmitz · August 17, 2017, 10:10pm

Sorry. Too late today and go to bed now. Good night.

Kem_Tekinay · August 18, 2017, 3:01am

Here’s an idea. Make the differences array doubles. If the difference is negative, store the absolute value plus 0.5.

Kem_Tekinay · August 18, 2017, 1:22pm

Using this idea, I came up with this. On an array of 2M elements, it takes about 200 ms in 32-bit, and 75 ms in 64-bit (aggressive) looking to match the last element - 1.

Public Function IndexOfBestMatch(arr() As Integer, target As Integer) as Integer
  dim foundIndex as integer = -1
  dim foundDiff as double
  
  for i as integer = 0 to arr.Ubound
    dim v as integer = arr( i )
    if v = target then
      foundIndex = i
      exit
    end if
    
    dim diff as double = v - target
    if diff < 0.0 then
      diff = ( 0.0 - diff ) + 0.5
    end if
    
    if i = 0 or diff < foundDiff then
      foundIndex = i
      foundDiff = diff
    end if
  next
  
  return foundIndex
End Function

Aaron_Hunt · August 18, 2017, 5:27pm

Thanks, Kem. Looks pretty good. I’ve optimised a weak spot in my original code, and now I’m working on a speed comparison of these few different approaches. Will post my findings here.

Aaron_Hunt · August 18, 2017, 6:24pm

Let’s call the 3 methods above “binary tree” (me), “double sort” (Kem), and “single loop” (Kem).

The results vary according to the size of the array we’re searching in. I tried each with arrays of 10,100, 1K, 10K, 100K, and 1M.

Kem’s “single loop” was fastest for arrays of 10, 100, 1K, and 1M.

Kem’s “double sort” was fastest for arrays of 10K and 100K. Why this should be is a mystery to me, but it’s consistent, repeatable.

Here are a few raw numbers on tests for an array of 1M values. In order to ensure duplicate values (to be sure to be able to search for values that don’t exist) the array is created with random numbers limited to 120% the size of the array.

to find the best match to 7794 in an array of 1M values where 7794 doesn’t exist:
binary tree: 300 ms
double sort: 266 ms
single loop: 194 ms
to find the best match to 114314 in an array of 1M values where 114314 doesn’t exist:
binary tree: 306 ms
double sort: 282 ms
single loop: 202 ms
to find the best match to 1199986 in an array of 1M values where 1199986 doesn’t exist:
binary tree: 303 ms
double sort: 290 ms
single loop: 281 ms

So, for very large arrays, the advantage of the single loop diminishes somewhat as the best match to the target value has higher indexes in the array.

My “binary tree” was never faster than the other two, but clocked in consistently regardless of the searched-for value, which suggests that it could be optimised for faster consistent results, but maybe not.

Kem_Tekinay · August 18, 2017, 6:27pm

You tested in a compiled app?

Aaron_Hunt · August 18, 2017, 7:28pm

Good point. That was in the IDE.