Searching for similarity

Apparently, Metaphone’s inventor Lawrence Philips came out with improved versions of Metaphone called Double Metaphone and Metaphone 3. When I was doing a lot of Filemaker work back in the late 1990’s and early 2000’s, I coded the original Metaphone algorithm and then added my own tweaks when I found problems such as KN and GHT. FWI, in my modified algorithm NITE and KNIGHT both produce the same code: MD.
I’ve added the ability to distinguish between letter combinations that occur at the beginning, middle or end of a word. So, KN at the beginning will convert to N, but not in the middle. Eventually, I’d made so many changes that I’ve changed the name of the function to Neophone.

Another troublesome name is Christian and its many variants: Christeen, Christen, Christena, Christene, Christian, Christiana, Christiane, Christin, Christina, Christine, Christinia, Cristen, Cristin, Cristina, Cristine, Kerstin, Kiersten, Kirsten, Kirstin, Kristan, Kristeen, Kristen, Kristian, Kristin, Kristina, Kristine, Kristyn, Krysten, Krystin, Krystina, Krystyna. These all encode to KRSDM in the modified algorithm.

EAUX is one that I hadn’t considered. I guess I’ll have to add it.

Meanwhile, the code for Neophone is:

Function Neophone(s As String) as String
  //Neophone phonetic index key generator
  //Robert Weaver, Saskatoon Canada, 2017
  //
  //Neophone is a Modification of Lawrence Philips' Metaphone phonetic index key generator.
  //The modifications are to address some of the shortcomings of the original Metaphone
  //algorithm. In particular, it addresses situations where the pronunciation of certain
  //letters or letter combinations change significantly when they precede or follow other
  //certain letters or letter combinations. The algorithm is based on English pronunciation.
  //
  '
  'Condenses input word into a series of letters representing phonetic groups A,B,D,K,L,M,R,S
  'A = any leading vowel (all other vowels are deleted)
  'B = B,F,P,V
  'D = D,T
  'K = C,G,K,Q
  'KS = X
  'L = L
  'M = M,N
  'R = R
  'S = c,g,j,S,X,Z (Lowercase refers to soft pronunciation of the letter)
  'Uses ReplaceAllB to ensure case sensitive search. Therefore, input string must not contain diacriticals.
  'Word is delimited with leading and trailing "." in order to facilitate searching for 
  'start of word and end of word characters.
  'Converted from FileMaker computed field, which used multi-argument Substitute function which is case sensitive.
  'Therefore, the use of lowercase in the replace string prevents further matches.
  'So, text search/replace pairs such as .TIA:.tia followed by TIA:SHA are used to prevent replacement if found
  'at the beginning of a word, but doesn't prevent replacement if found elsewhere.
  dim i As Integer
  dim sIn As String = "."+Trim(Uppercase(s))+"."
  efOut.AppendText sIn+EndOfLine
  'Step 1: Process specially pronounced letter combinations
  static searchStr1() As String =  Split(".X,.KN,.PN,.GN,.TIA,.TIO,TCH,WR,WH,MB.,GHT,GH.,GH,CHR,CR,SCH,STIA,STIO,TIA,TIO,CH,CE,CI,CY,CK,DGE,DGI,DGY,GE,GI,GY,PH,Q,V,X",",")
  static replaceStr1() As String = Split(".S,.N,.N,.N,.tia,.tio,CH,R,W,M.,T,.,G,CR,KR,SK,STA,STO,SHA,SHO,SH,SI,SI,SI,K,JE,JI,JI,JE,JI,JI,F,K,F,KS",",")
  Static n1 As Integer = UBound(searchStr1)
  for i=0 to n1
    sIn=ReplaceAllB(sIn,searchStr1(i),replaceStr1(i))
  next
  efOut.AppendText sIn+EndOfLine
  sIn=Uppercase(sIn)
  'Step 2: Convert vowels and silent letters
  Static searchStr2() As String = Split("E,I,O,U,WA,WY,YA",",")
  static replaceStr2() As String = Split("A,A,A,A,wA,wY,yA",",")
  Static n2 As Integer = UBound(searchStr2)
  for i=0 to n2
    sIn=ReplaceAllB(sIn,searchStr2(i),replaceStr2(i))
  next
  sIn=Uppercase(sIn)
  'Step 3: Combine similar sounding consonants
  Static searchStr3() As String = Split("G,C,TH,J,Z,F,P,V,H,N,T,Y,W",",")
  static replaceStr3() As String = Split("K,K,S,S,S,B,B,B,A,M,D,,",",")
  Static n3 As Integer = UBound(searchStr3)
  for i=0 to n3
    sIn=ReplaceAllB(sIn,searchStr3(i),replaceStr3(i))
  next
  'Step 4: Remove adjacent duplicates
  dim ch As String = midB(sIn,1,1)
  dim sTemp As String = ch
  For i=2 to Len(sIn)
    if ch <> midB(sIn,i,1) then
      ch=midB(sIn,i,1)
      sTemp=sTemp+ch
    end if
  Next
  'Step 5: Delete vowel markers and delimiters
  sIn=ReplaceAllB(sTemp,".A",".0")
  sIn=ReplaceAllB(sIn,"A","")
  sIn=ReplaceAllB(sIn,".","")
  sIn=ReplaceAllB(sIn,"0","A")
  Return sIn
End Function

@Robert Weaver : efOut is undefined ?

Sorry, the 3 lines that have efout in them should be deleted. efout was an edit field that I was using to monitor the operation during debugging. I didn’t realize those lines were still there.

Thank you all for your replies.

I deliberately avoided being specific about the application
I have in mind and about the SQL implementation in use (I
generally try to avoid implementation-specific features for
future portability).

The database involved contains (today) nearly 14,000 records
and is growing at the rate of some 1,000 records a month.
Each record has 42 fields, most of which are numeric, but it
also has text fields (company name, locality, state/province,
country, short description, etc).

The database is on a local server running MariaDB and on a
remote server running MySQL.

What complicates matters is that the texts are in just about
every European language (including Basque and Maltese) and in
Romanized versions of languages such as Russian and Chinese.
This would, I think, make any sort of lookup table far too
cumbersome to be of practical use.

Data entry is quite straightforward. The difficulty comes when
searching for entries, which inspired my posting.

I’ll investigate your suggestions and see what happpens!

Try asking the question on Stackoverflow.com, and make it not specific to Xojo, by just specific to MySQL. Maybe there are extensions for it or other tricks that can help you.