Searching for similarity

Peter_Rodwell · August 22, 2018, 5:37pm

Does anyone know of a way to implement an SQL “similar” function?

What I’m looking for is an improvement on the LIKE function. For
example, a statement such as

SELECT * FROM my_dbase WHERE name LIKE '%Smith%"

would return those records containing “Smith”, “Smith & Son”, “Smithson”, etc,
but not “Smyth”. I’d like a statement such as

SELECT * FROM my_dbase WHERE name SIMILAR TO '%Smith%"

such that “Smyth” would also be returned.

Of course, it would be necessary to define “similar”, perhaps with
something like

SELECT * FROM my_dbase WHERE name 50% SIMILAR TO '%Smith%"

Does anyone have any suggestions as to how this could be implemented?
A brute force method with letter-by-letter comparisons seems the
obvious way but this would probably be too slow for databases with
many thousands of records.

Kem_Tekinay · August 22, 2018, 5:39pm

What about LIKE ‘%Sm_th%’?

DaveS · August 22, 2018, 5:43pm

There are a number of algorithms for this kind of thing… Soundex is one, and I don’t recall the exact name, but Levenstien or something “similar”

AlbertoD · August 22, 2018, 5:44pm

My guess is that Smyth was just an example.
I guess SIMILAR will also report:

Smitx
Snith
Smoth
and others

Kem_Tekinay · August 22, 2018, 5:46pm

We are also assuming this is specific to SQLite, right?

DaveS · August 22, 2018, 5:47pm

if so, then LIKE is about the only choice… and that won’t find things that sound the same

Beatrix_Willius · August 22, 2018, 6:01pm

You need to define “similar” with a bit more detail. The algorithm that Dave mentioned is this one: https://en.wikipedia.org/wiki/Levenshtein_distance

Markus_Winter · August 22, 2018, 6:18pm

The MBS SQLite extensions have SoundEx

See https://www.mbsplugins.de/archive/2010-03-09/Release_of_our_SQLite_Extensio and https://www.mbsplugins.de/archive/2015-09-24/Embedded_SQLite_and_encryption

DaveS · August 22, 2018, 6:22pm

for the record… Soundex is crude, and not too accurate… is reduces any word to a letter and 3 numbers… for example my name if I remember boils down to S450 … as to thousands of words that sound nothing like S

José_María_Terry_Jiménez · August 22, 2018, 6:22pm

Search for soundex and your database engine. Many implement it in any or other way.

https://en.wikipedia.org/wiki/Soundex

DaveS · August 22, 2018, 6:27pm

Don’t waste your time on Soundex… trust me, the results are totally inaccurate, and you will just be wasting your time… the Soundex Algorithm is super simplistic…

Markus_Winter · August 22, 2018, 6:35pm

Dave, basically you are saying use like or forget about it. Come on, Im sure you can come up with something more helpful.

In the meanwhile this might be a worthwhile read: Using SOUNDEX and DIFFERENCE to Standardize Data in SQL Server

DaveS · August 22, 2018, 6:38pm

not what I am saying at all… what I am saying is don’t waste your time with Soundex…

I know exactly how it works, and what its pitfalls are. The Levenshtein_distance is way more accurate (and more complex), but even it is not “perfect”…

Likes will only find things that have a similar pattern to the characters, but has nothing to do with what it sounds like…

Thomas_Tempelmann · August 22, 2018, 6:40pm

Several techniques come to mind:

Replace common typos and similarities

a)

This works if you can predict all the possible things you want to see as similar, e.g. “y” = “i”, and “ou” = “uo” = “oo”, and so on.
In this case, you can create a new table or add a column to your existing table, in which you put the simplified names, by taking each original name and replace all the possible similarities with one common replacement, i.e. replace any “y” with “i”, and any “uo” and “ou” with “oo”. Then write those simplified names into the new column. Now, when you want to search, you’d simplify your search string with the same replacement function, and then search the new column using the LIKE “%searchstring%” pattern.

This method would be fairly fast, i.e. as fast as your current LIKE … method.

b)

Even faster would be to use Sqlite’s FTS (google it) on this simplified column. That way, using the FTS search, you can, by looking for “Smith”, find “Smithson” and “Smith & sons”, but not “Nosmith” (that’s becausr its lookup goes by word’s beginnings, but is very very fast).

Use an extension to compare the strings

You could also dynamically add a comparison function to the sqlite library. I’ve documented this here: http://blog.tempel.org/2015/09/xojosqliteextensions.html (look for custom function). This is much harder to do, though, and you’ll still have to code the method that will decide what’s similar. Here, the Levenshtein Distance function would perhaps make sense to be used.

DaveS · August 22, 2018, 6:47pm

That all assumes similar spelling and has nothing to do with “sound”
“PHISH” and “FISH” sound the same… but most algoritms won’t say so, and LIKES and SOUNDEX will both fail (unless you do *ISH)

Thomas_Tempelmann · August 22, 2018, 6:52pm

@Dave S Did you refer to my post above yours? Because my proposed first method would also replace “ph” with “f”, making this a “sound” algorithm Of course, as I wrote, this only works if you can predict all those similarities so that you can hard-code them into the replacement function. Maybe the web has some tables for that, making this easier to build a list of them?

Paul_Lefebvre · August 22, 2018, 7:25pm

As others have mentioned, SoundEx is one option. Here’s a Xojo implementation you can use with SQLite:

http://developer.xojo.com/userguide/soundex

Another option might be to consider SQLite FTS5 for full text searching as it might match better depending on what you’re doing.

http://developer.xojo.com/userguide/sqlite-full-text-search

Paul_Finer · August 22, 2018, 7:55pm

Joe Strout has a String Utils module that includes a function to calculate Levenshtein distance.

http://www.strout.net/files/rb/stringutils.zip

Robert_Weaver · August 22, 2018, 8:06pm

It’s important to know whether you are searching for a single word, or trying to find similarity between samples of multi word text. Depending on this, the viable solutions are significantly different.

There are variants of Soundex, such as Metaphone that are significant improvements. However these are really only suitable for matching of single words, not multi word text samples. However, they are very fast in database applications. Soundex/Metaphone can be calculated once, at the time that the data is entered into the database, and used as an index field. Then, for a search, the Soundex/Metaphone value is calculated only on the search text and then used to find the index value.

Levenshtein Distance is suitable for matching multi word text, but it’s expensive because it has to be calculated for every combination of search text and database text, so you can’t pre-index the database fields with a Levenshtein Distance value without knowing the search text in advance. And, of course, it would be different for every different search text. So, it ends up being a record by record calculation.

John_A_Knight_Jr · August 22, 2018, 10:23pm

[quote=402070:@Dave S]That all assumes similar spelling and has nothing to do with “sound”
“PHISH” and “FISH” sound the same… but most algoritms won’t say so, and LIKES and SOUNDEX will both fail (unless you do *ISH)[/quote]
My name never shows up properly in a Soundex search.
I remember adding a few tweaks before applying standard Soundex to cover a few anomalies like that:
Starts with KN, replace with N (not many silent 'K’s in the middle of words)
Starts with PH, replace with F (PH in the middle becomes 1, ‘bfpv’, so Stephen and Steven match)
Starts with PS , replace with S
EAUX anywhere, replace with a vowel (O)
This was 1988, pre Metaphone, but worked for our data.
Hmm, Metaphone doesn’t account for ‘EAUX’