Fuzzy search

Hi,

in the past, the elfData plug-in enabled fuzzy search where un-perfect matches can be found.
Does anyone know whether there exist the equivalent for Xojo somewhere ?

many thanks,
best,
Franck

I’m going to look into Elastic Search in more detail soon. Yes, it’s Java. But it’s very possible to talk to Java applications with Xojo.

http://en.wikipedia.org/wiki/Soundex
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

http://www.mbsplugins.eu/StringSoundEx.shtml

Found this for RealBasic but could not figure it out
http://mythtech.net/name.html

VB code
The “Soundex” Function in Visual Basic
http://www.source-code.biz/snippets/vbasic/4.htm

SoundEx function
http://www.freevbcode.com/ShowCode.asp?ID=1136

Both SoundEx and Metaphone are part of the StringUtils module. (I don’t have a link handy.)

Metaphone is in my M_String module, found here:

http://www.mactechnologies.com/downloads

http://www.strout.net/info/coding/rb/intro.html
+
http://www.additional.com/community/realbasic/stringutilsmethods.html

Literature :
http://great-white-software.com/rblibrary/index.php?main_page=product_info&products_id=109

Many thanks for these tracks.
I however was looking for a way to detect imperfect matches in DNA sequences and thus the Soundex and Metaphone cannot be used.

@Beatrix
I would love to know more about how to use java routines in Xojo. I would then have access to the load of tools available in repository like BioJava. Do you use the MBS plug-in ?

Levenstein distance is a decent fuzzy search - and you can get a measure of how close something matches

http://en.wikipedia.org/wiki/Levenshtein_distance

And there are several others listed on http://en.wikipedia.org/wiki/Fuzzy_string_searching

@Franck: yes, I use the MBS plugin. I’ve used Jave before for talking to the iText pdf library (http://itextpdf.com/).

The easiest way to talk to Java is if the application is a console app. It’s just required to call the app with the parameters. MBS needs a bit more effort but there it’s possible to call methods directly.

Levenstein and even more so metaphone are easy and simple. I’ll be going to use Elastic Search to speed up searching text in a database. There aren’t that many solutions around for that.

Thanks.
I have coded a smith-waterman algo in Xojo and it works fine but I will compare its sped with Levenstein (although smith-waterman does not only give a distance but also find the best match). I was hoping to get access to something faster than Xojo.

[quote=124982:@Franck Perez]Thanks.
I have coded a smith-waterman algo in Xojo and it works fine but I will compare its spêed with Levenstein (although smith-waterman does not only give a distance but also find the best match). I was hoping to get access to something faster than Xojo.[/quote]

Sounds interesting Franck, are you going to make the method available to others as open source? Maybe with others looking at it as well some performance tweaks could be found.

Well. Yes, I would be happy to post it and share it with the community (and may be benefit from optimisation indeed !) but being a self-trained developer, my code is not always very nice or well commented…
I’ll go back on my code an see if it is not too bad to become public…

Nah, don’t worry about that! Peer reviewed code is a fantastic way to learn and get better code in the process, for free!

[quote=125050:@Franck Perez]Well. Yes, I would be happy to post it and share it with the community (and may be benefit from optimisation indeed !) but being a self-trained developer, my code is not always very nice or well commented…
I’ll go back on my code an see if it is not too bad to become public…[/quote]

Do not be shy. We are not doing a concours d’élégance and a lot of use are self-trained. This forum is precisely the opportunity to share code for the benefit of all. And sometimes, you will see others go out of their way to improve your code if it is possible, just for the pleasure to help.

I have adapted to Xojo the VB Soundex example at http://www.freevbcode.com/ShowCode.asp?ID=1136

Upon call, it takes the word to compare as string, and the length of the soundex as optional second argument. If nothing is set, it defaults to 4. Max length of the result is 10.

It can be used to compare strings by sound. For instance if Soundex("Soup") = Soundex("Soap") then

[code]Function SoundEx(WordString as string, optional SoundexLen as integer) As String
if SoundexLen = 0 then SoundexLen = 4
Dim Counter As Integer
Dim CurrChar As String
Dim Soundex as string

If SoundExLen > 10 Then
SoundExLen = 10
ElseIf SoundExLen < 4 Then
SoundExLen = 4
End If
SoundExLen = SoundExLen - 1

WordString = Uppercase(WordString)

For Counter = 1 To Len(WordString)
If Asc(Mid(WordString, Counter, 1)) < 65 Or Asc(Mid(WordString, Counter, 1)) > 90 Then
WordString = left(WordString,counter-1)+" "+right(WordString,len(WordString)-counter)
End If
Next Counter
WordString = Trim(WordString)

SoundEx = WordString

SoundEx = ReplaceAll(SoundEx, “A”, “0”)
SoundEx = ReplaceAll(SoundEx, “E”, “0”)
SoundEx = ReplaceAll(SoundEx, “I”, “0”)
SoundEx = ReplaceAll(SoundEx, “O”, “0”)
SoundEx = ReplaceAll(SoundEx, “U”, “0”)
SoundEx = ReplaceAll(SoundEx, “Y”, “0”)
SoundEx = ReplaceAll(SoundEx, “H”, “0”)
SoundEx = ReplaceAll(SoundEx, “W”, “0”)
SoundEx = ReplaceAll(SoundEx, “B”, “1”)
SoundEx = ReplaceAll(SoundEx, “P”, “1”)
SoundEx = ReplaceAll(SoundEx, “F”, “1”)
SoundEx = ReplaceAll(SoundEx, “V”, “1”)
SoundEx = ReplaceAll(SoundEx, “C”, “2”)
SoundEx = ReplaceAll(SoundEx, “S”, “2”)
SoundEx = ReplaceAll(SoundEx, “G”, “2”)
SoundEx = ReplaceAll(SoundEx, “J”, “2”)
SoundEx = ReplaceAll(SoundEx, “K”, “2”)
SoundEx = ReplaceAll(SoundEx, “Q”, “2”)
SoundEx = ReplaceAll(SoundEx, “X”, “2”)
SoundEx = ReplaceAll(SoundEx, “Z”, “2”)
SoundEx = ReplaceAll(SoundEx, “D”, “3”)
SoundEx = ReplaceAll(SoundEx, “T”, “3”)
SoundEx = ReplaceAll(SoundEx, “L”, “4”)
SoundEx = ReplaceAll(SoundEx, “M”, “5”)
SoundEx = ReplaceAll(SoundEx, “N”, “5”)
SoundEx = ReplaceAll(SoundEx, “R”, “6”)

CurrChar = Left(SoundEx, 1)
For Counter = 2 To Len(SoundEx)
If Mid(SoundEx, Counter, 1) = CurrChar Then
SoundEx = left(SoundEx,counter-1)+" "+right(SoundEx,len(SoundEx)-counter)
Else
CurrChar = Mid(SoundEx, Counter, 1)
End If
Next Counter
SoundEx = ReplaceAll(SoundEx, " ", “”)

SoundEx = Mid(SoundEx, 2)
SoundEx = ReplaceAll(SoundEx, “0”, “”)

dim lo as integer = SoundexLen
dim stringue as string
for i as integer = 1 to lo
if lo = 0 then exit
stringue = stringue+“0”
next
SoundEx = SoundEx + stringue
SoundEx = Left(WordString, 1) + Left(SoundEx, SoundExLen)
return Soundex
End Function
[/code]

Here is the Levenshtein Distance function. Compares two strings and gives a “distance” integer according to the similarity. Identical strings generate zero, and the more different the strings, the higher the number.

http://en.wikipedia.org/wiki/Levenshtein_distance

[code]Function LevenshteinDistance(s as string, t as string) As integer
Dim n As Integer = len(s)
Dim m As Integer = len(t)
Dim d(100,100) As Integer

If n = 0 Then
Return m
End If

If m = 0 Then
Return n
End If

Dim i As Integer
Dim j As Integer

For i = 0 To n
d(i, 0) = i
Next

For j = 0 To m
d(0, j) = j
Next

For i = 1 To n
For j = 1 To m

  Dim cost As Integer
  If mid(t,j,1) = mid(s,i,1) Then
    cost = 0
  Else
    cost = 1
  End If
  
  d(i, j) = MathMin(MathMin(d(i - 1, j) + 1, d(i, j - 1) + 1),d(i - 1, j - 1) + cost)
Next

Next

Return d(n, m)

End Function
[/code]

This method uses another to mimic .Net Math.Min, which returns the lower of the two terms :

Function MathMin(n as integer, n1 as integer) As integer if n < n1 then return n else return n1 end if End Function

Or just use Min

Why did I not think about that ? Thank you.

[code]Function LevenshteinDistance(s as string, t as string) As integer
Dim n As Integer = len(s)
Dim m As Integer = len(t)
Dim d(100,100) As Integer

If n = 0 Then
Return m
End If

If m = 0 Then
Return n
End If

Dim i As Integer
Dim j As Integer

For i = 0 To n
d(i, 0) = i
Next

For j = 0 To m
d(0, j) = j
Next

For i = 1 To n
For j = 1 To m

  Dim cost As Integer
  If mid(t,j,1) = mid(s,i,1) Then
    cost = 0
  Else
    cost = 1
  End If
  
  d(i, j) = Min(Min(d(i - 1, j) + 1, d(i, j - 1) + 1),d(i - 1, j - 1) + cost)
Next

Next

Return d(n, m)

End Function
[/code]

Thanks for the examples Michel.
For those interested, I have posted the Smith-Waterman module in this new thread .
I could not post the complete routine (too long) so a link to a smith-waterman alignement example application is provided.

[quote=125416:@Franck Perez]Thanks for the examples Michel.
For those interested, I have posted the Smith-Waterman module in this new thread .
I could not post the complete routine (too long) so a link to a smith-waterman alignement example application is provided.[/quote]

The more we share, the richer this forum gets :slight_smile: Thank you.