Remove duplicate records from large database

Craig_Grech · March 31, 2017, 3:06am

I’ve got an SQLite database (320,000 records) containing a few duplicate records. What’s the quickest way to remove them.

I’ve written a function that works but I gave up waiting for it to finish.
Basically, I created a new table with identical fields to hold only unique records.
I looped through main table and checked the new table for a record with a few of the most critical fields.
If the record doesn’t exist I add it to the new table.
If it does exist, write to a log file so I can see what duplicates were captured.
Any ideas on a faster way of doing it?

PS. one good thing about this method is I can stop the program at anytime and resume later. It flys through the records I’ve already done so I’m presuming the speed is lost in writing new records to the table.

Norman_P · March 31, 2017, 3:11am

write a query that finds the second instance of a row and records the row id of that and then nuke the second (or third and fourth) ones

you can find these with a select … group by … having

select primary_key_id from <table> group by <here you probably want all the columns that make a row a duplicate> having count(*) > 1

or something loosely like that should list all rows that have 2 or more versions with identical value in whatever columns you list in the group by

if you want to see all the columns as well then select more than just the primary key

Stan_F · March 31, 2017, 3:31am

DELETE FROM A FROM YourTableName A WHERE EXISTS(SELECT * FROM YourTableName B WHERE A.PrimaryKey<B.PrimaryKey AND A.DuplicateColumnName=B.DuplicateColumnName)

WARNING! Backup your data! Untested query!

To get faster results:

DELETE TOP (10) FROM A FROM YourTableName A WHERE EXISTS(SELECT * FROM YourTableName B WHERE A.PrimaryKey<B.PrimaryKey AND A.DuplicateColumnName=B.DuplicateColumnName)

Keep repeating until nothing is deleted.

Jeff_Tullin · March 31, 2017, 4:53am

delete from thetable where primarykey in (select min(primarykey) from table group by duplicatefield having count(1) > 1);

run that until it reports zero rows if there are triples or more.

Craig_Grech · March 31, 2017, 5:21am

Norman Palardy wrote:

Craig_Grech · April 2, 2017, 10:40pm

Selecting the duplicates worked fine (thanks Norman) but I was unable to delete them without a primary key.
The file is a csv import and doesn’t contain a primary key.
I thought about recreating the file but it takes such a long time to add all the records. That’s another thing I can’t find a solution to.

Instead I used SELECT rowid FROM table_main GROUP BY ... .
Then I read that using the rowid as a primary key isn’t wise and can lead to the wrong record being deleted. Is this true?

Norman_P · April 2, 2017, 10:54pm

IF you only use the rowid for deletion then ignore it for everything else you should be fine

Tim_Hare · April 2, 2017, 11:58pm

Only if you store it and then use it later. It may have changed in the interim. If you’re using it right away, it shouldn’t be a problem.

Norman_P · April 3, 2017, 1:11am

In sqlite the only time it recycles and possibly reuses is if you vacuum - which is kind of sucky but …
Otherwise they are stable
So if you just inserted all your data you case safely use the rowids to remove duplicates
And if these rowids are not used as foreign keys in other tables then it wont matter in any event
Thats when sqlite’s recycling of rowids causes issues since you insert a row and a related row in another table using the rowid and the IF the rowid changes your relationship is wrong

AND if you define an explicit INTEGER PRIMARY KEY for the table then the warning about ROWID’s doesn’t apply

Craig_Grech · April 3, 2017, 7:58pm

Thanks, my app was a one off so what is done is done now but next time I will add a primary key.
The good thing about XOJO is you can make something really quick. My boss wanted me to take a 320,000 record csv file and perform a heap of very explicit sorting operations on it to produce multiple filtered csv files. He had been messing with it for weeks using Excel and got absolutely nowhere.
My boss couldn’t give a rat’s about how I did it. If I try to explain anything technical he puts his fingers in his ears and says “La, la,la,la …” then walks off saying “just get it done”
I wrote a program in XOJO to convert it to an SQLite file, remove duplicates and do all the SQL queries, export the files back to csv and had everything his mailbox within 2 hours. No interface except a window with a button, one SQL string which I manually edit for each query, run the program in the IDE. The boss is pleased. XOJO saves the day.
And I couldn’t have achieved this without the help from the wonderful people on this forum.