I am having some trouble with de-duplicating an array. I used a method that I picked up from this forum from some years ago (Xojo Array: Remove duplicate elements?), and it is not removing all duplicates.
This is the method, which takes as its parameter a previously-sorted array (I’ve checked the sorting step and this works fine):
for i as Integer = 0 to theArray.Ubound-1
// compare the current member with the next member
// if they are the same, remove the current member
if theArray(i) = theArray(i + 1) then
theArray.Remove(i)
end if
next
When I check the results most of the duplicates are still there. I think it may be to do with the counting of array members. The method uses an incrementing integer i, and if I remove the i-th member, does the next member get assigned i as its position in the array? If not, does that mean that the counter “skips over” the next member?
Also, why doesn’t it give an error at the Ubound end of the array? If I have made the array smaller by removing members, the value of Ubound(theArray) will be wrong, won’t it?
All rather confusing. Can someone set me straight on this please?
Dim n as integer = ubound( theArray )
For l as integer = n downto 1
if theArray( i ) = theArray( i - 1 ) then
theArray.remove i
end if
next
Typed directly into the forum without testing, but it should give you an idea.
Because you’re calculating the loop end on each loop. Storing it into a variable and working backwards should prevent this and should improve the performance because it’s not having to count the number of items in each loop.
Think of it as a stack, you pull one out and the one above falls down into that position.
Edit: This also only works if the array is sorted so that the duplicates are next to each other. If they’re not, you can either sort the array or use a dictionary (with dictionary.hasKey) to see if you’ve already seen that item or one of the same value.
Yes, that is the problem. If you’re going to remove elements, you have to go from Ubound downto 0. Otherwise, you will skip elements of the array completely.
'filter if we have more than 1 entry
If count > 0 Then
'remember the last entry
previousValue = a(count)
'iterate from the second last entry down to the first entry
count = count - 1
For i = count DownTo 0
If a(i) = previousValue Then
'the current entry matches the previous entry so remove it
a.Remove(i)
Else
'the current entry does not match the previous entry so remember the current entry's values
previousValue = a(i)
End If
Next
Var theArray() as string
theArray.add("Ian")
theArray.add("Ian")
theArray.add("Jeannot")
theArray.add("Paul")
theArray.add("Geoff")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Jeannot")
theArray.add("Jeannot")
theArray.Sort
for i as Integer = 0 to theArray.Ubound-1
// compare the current member with the next member
// if they are the same, remove the current member
if theArray(i) = theArray(i + 1) then
theArray.RemoveAt(i)
i = i - 1
end if
next
break
This will work as well and helps probably understanding the challenge. But the better way is to iterate backward, like the other colleagues suggested.
Thanks everyone for the replies. I tried Sam’s (and Tim’s) method and that works perfectly. And in fact the original page that I referred to iterates down from the top of the array to the bottom. I naively thought that this was the same as going from the bottom to the top.
Well it is more or less the same, but not from an “index perspective” ;-), as you are deleting the duplicate rows. If you would buffer the ubound variable (or better user
theArray.LastRowIndex
as uBound is obsolete in API v2, you will see the effect in the debugger.
I wrote a Method to deduplicate that uses a memory-based SQLite database!
This allowed me to either remove EVERY duplicate that has a pair, or remove just the excess duplicates (keeping one). It also allowed me to keep the original order of the String items (not need to pre-sort) if the original order must be maintained.
I know it is overkill for the OP, but someone might find it useful in years to come!
Protected Function getRemoveDuplicatesWAD(SearchString As string, Delimiter As String, removeAllDuplicates As Boolean = False) as String
Var tempArray(-1) As String
Var tempSQLiteDatabase As New SQLiteDatabase
Var rs As RowSet
Var rec As New DatabaseRow
if not tempSQLiteDatabase.Connect then Return ""
tempSQLiteDatabase.ExecuteSQL("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, TextItem TEXT COLLATE NOCASE)")
'add every item to database
tempArray = SearchString.SplitBytes(Delimiter)
for tempInt As Integer = 0 to tempArray.LastRowIndex
rec = New DatabaseRow
rec.Column("TextItem").StringValue = tempArray(tempInt)
tempSQLiteDatabase.AddRow("data", rec)
next
'find the duplicates
if removeAllDuplicates then
tempSQLiteDatabase.ExecuteSQL("DELETE FROM data WHERE TextItem IN(SELECT TextItem FROM data GROUP BY TextItem HAVING COUNT(*) > 1) ") 'remove every item that's a duplicate
else
tempSQLiteDatabase.ExecuteSQL("DELETE FROM data WHERE NOT id IN(SELECT MIN(id) FROM data GROUP BY TextItem) ") 'remove duplicates, but keep the first one found in the list
end if
ReDim tempArray(-1)
Try
rs = tempSQLiteDatabase.SelectSQL("SELECT TextItem FROM data ORDER BY id ")
Catch Error
CommonWindow.doDisplayErrorWAD(Error.Message, CurrentMethodName, Error.ErrorNumber)
Return ""
End Try
If rs = Nil Then
Return ""
End If
while not rs.AfterLastRow
tempArray.AddRow(rs.Column("TextItem").StringValue)
rs.MoveToNextRow
wend
Return String.FromArray(tempArray, Delimiter)
End Function
About performance, this came up before many years ago on the mailing list and was tested in Real Studio. It was found that it really made no difference, so I decided to test again.
My code creates an integer array of 10M elements, then assigns its index to each element and calls a function. Example:
#if not DebugBuild
#pragma BackgroundTasks False
#pragma NilObjectChecking False
#pragma StackOverflowChecking False
#pragma BoundsChecking False
#endif
dim msg as string
dim sw as new Stopwatch_MTC
sw.Start
var arr( kLastRowIndex ) as integer
for i as integer = 0 to arr.LastRowIndex
arr( i ) = i
sw.Lap
next
sw.Stop
msg = "Last Row: " + sw.ElapsedMilliseconds.ToString( "#,##0.0##" ) + " ms, Avg: " + sw.AverageLapMicroseconds.ToString( "#,##0.0##" ) + "µs"
AddToResult msg
The results show there really isn’t much difference and, unless you are trying to squeeze out every last microsecond, you don’t need to bother.
The first line is when using the code above, the second is when caching LastRowIndex to a variable first.