What strategy(ies) to avoid duplicates ?

Emile_Schwarz · April 20, 2018, 1:40am

A client talked about strategy to avoid duplicate Record for a persons data base.

Beside checking for unique First Name / Family Name / Birth Date (if all three are the same *), what can we do (programmatically speaking) ?

At entering time. A check upon the data base contents: I do not know: two Records may have identical fields (these three), but can have differences in other Fields

Until now, I do that manually, using the Sort Columns and read the whole DB contents.

Fun: Imagine a field with Stressed and another with Dessert. Same number of characters, same characters, the only difference is the read order (left to right / right to left)
Another one is Martine / Martien (French first name): same length / same characters / twwo inverted vowels. (the / teh), etc.

DaveS · April 20, 2018, 2:14am

if you have control of the database structure

ALTER TABLE YourTable
  add CONSTRAINT YourTable_unique UNIQUE (B, C, D);

now the database won’t ALLOW you enter a duplicate… you may want to add NOCASE attribute is you want upper to match lower case

Robert_Livingston · April 20, 2018, 3:16am

It can be arbitrarily complicated. It depends on how many fields that you have to deal with (First, Last, Middle, Phone, Birthday, SS# etc.)

The more fields that you have, the more sophisticated a strategy can be employed. It depends on how may people are in the database.

If you are dealing with large databases (100,000) range and if there is manual entry then there will be many many duplicates. It is easy to alert when the names are the same and the data entry is perfect. But the issue in real life is that people’s names “change” (marriage, divorce, nicknames etc.) Data entry is very imperfect when done by humans.

So consider doing lots of different evaluations to look for duplicates. Basically you are trying to create metric for “similarity”. Robert and Bob are more similar than Robert and David. Social Security numbers that are off by 1 or 2 numbers can be assigned a value for similarity (The last 4 digits much more important than the others) Birthdays that are “off” by one month or day are “similar”. Transpositions are common with manual entry (7/12/50 is similar to 7/21/50) etc. In my own work I developed a “score” to estimate how “similar” two entries were.

Then I could manually look at all cases with high scores.

There are algorithms to evaluate how “similar” two strings are. (for example Levenshtein Distance that you can look up in Wikipedia but I found custom code appropriate to the population at hand more successful and faster. There are algorithms that evaluate how similar names are in terms of sound. (Kathy/Cathy Jefferson/Jeffersen etc.)

If you have 100,000 people, comparing every entry against every other entry is very or perhaps prohibitively time consuming. I would tend to sort the people by many different criteria and then only compare adjacent or the matching group by that criteria. So I might have ten ways of sorting the database and then compare adjacent entries and give them a similarity score.

For example sort by the first 4 letters of the last name concatenated to the last 5 digits of the phone number. Or the birthday and the last 4 digits of the social security number etc. Then assign a score to the “similarity” of adjacent people and create a list of these pairs and ultimately sort them by their score. The true duplicates will tend to float to the top.

The population matters. Old white Americans, Frenchmen, etc. all will have algorithms based on names that would be different. Two entries with last name Smith are less likely to be duplicates that two entries with the last name Quismodo. A match between a “middle” name and a last name or first name and middle name is given “some” weight but not as much as between two last names.

Once you start finding the duplicates in the database, you can start refining the scores that you assign to various degrees of similarity.

Finding exact matches is not a real description of the task at hand. You build up a fairly complex bit of code to deal with this situation. Just keep refining your code as you start seeing the “results”

Jay_Madren · April 20, 2018, 4:20am

I agree with (and have done) most of what Robert said. But he was focused more on identifying existing duplicates. I assume you want to check for possible duplicates before adding a new record. To make the process quicker, I store “search ready” copies of all the fields of interest (remove everything but alpha characters for a name field, for instance). When a user started entering the various fields into a new record form, I would run a procedure that came up with possible matches ordered by score as Robert mentioned (rerunning the procedure as each field was entered/modified). If any scored above a certain threshold, I would present those record(s) to the user and ask if any of them were the same person they were trying to add. Humans are much better at determining the true matches than any query, especially if you have several pieces of data each of which could be slightly different than the proposed match.

You can even allow the user to set the match threshold value so they can adjust how close of a match they want presented.

Robert_Weaver · April 20, 2018, 6:04am

You might want to check out some phonetic indexing algorithms such as “Soundex” and “Metaphone” which convert similar sounding but differently spelled words into the same key value, which can then be used to search for duplicates. Years ago I wrote a variation of the Metaphone algorithm for Filemaker. I probably still have it around somewhere.

Emile_Schwarz · April 20, 2018, 7:25am

Thank you all.

During the night (it’s the morning in France now), my brain worked a bit and here I am:

a. Apparently, doctor(s) sort peple by First Name / Family Name / Birth Date, This can be used at RecordSet writing time.

b. According what I already done (different features), I can create check method that build a list of suspected potential duplicates and let the user deal with that. I can also give him/her a white paper" of technics to make a quest against duplicates using the Listbox sort buttons (of course, depending on the Field contents, here I have FirstName / Family Name / Birth Date / Address / Zip / City / Phone #, etc.).

c. For the dates, it is impossible to avoid errors, but I can narrow the error window, using either a Calendar entry or three PopupMenus

At last, if this is not enough, I have ideas (read above) to explore at customer cost.

In a far different project, the duplicate is based on a date column:two times the same date is impossible (in the context) and I have two RadioButtons: Report (potential errors) or Delete (for the Bold). I do that because two Records can are incomplete, but not identical; reporting the duplicate allow to complete a Record and delete the now useless Row.

PS: I love the “similarity” concept.

Robert_Weaver · April 21, 2018, 6:41am

When considering similarity between two text samples there are two distinct classes of functions which can make a huge difference in calculation complexity when dealing with databases:

Functions which directly calculate the similarity of two text samples. For input parameters, they take the two text samples, and calculate a value corresponding to how similar they are.
Functions which classify a single text sample in a certain way (eg., phonetically) and generate a key value. The function is applied separately to each text sample and the output key values of the function are then compared to determine if the text samples are similar.

The first type includes Levenshtein Distance and Q-gram type functions. The second type includes phonetic indexing functions such as Soundex and Metaphone. The first type may appear to be simpler at first, but they require a lot more processing. If you have a database with n records, and you want to find existing duplicates, you have to calculate type 1 functions n^2 times, in order to compare every record with every other record. When using type 2 functions. You only have to calculate the function n times (once per record), and then sort by the type 2 function output key value. Similar records will be grouped together, and you can step through the records comparing only adjacent records.

In the case where a new record is being created and you want to find if it already exists, you must recalculate the type 1 function n times (once for every existing record) to see if any of them match the new record. You only have to calculate the type 2 function once, to get the key value for the new record, because the type 2 function key values have already been calculated once for all of the existing records. So you merely have to search for the key value of the new record to see if a duplicate exists. This makes type 2 functions much more computationally efficient. The drawback is that type 2 functions don’t work as well. Depending on their algorithms they may result in many false positives or many false negatives.

My own preference would be to go with the type 2 function coded in such a way that it’s more likely to generate false positives than false negatives. Then, if the number of positive matches is excessively high, you could apply a type 1 function to the set of matches returned by the initial type 2 function to reduce the number of records returned.

Marius_Dieter_Noetzel · April 21, 2018, 10:46am

SELECT * FROM (SELECT count(*) as amount, firstname, lastname, birthdate FROM datatable GROUP BY firstname, lastname, birthdate ORDER BY amount DESC) as a WHERE amount > 1;