De-duplicating array not working

Hi,

I am having some trouble with de-duplicating an array. I used a method that I picked up from this forum from some years ago (Xojo Array: Remove duplicate elements?), and it is not removing all duplicates.

This is the method, which takes as its parameter a previously-sorted array (I’ve checked the sorting step and this works fine):

  for i as Integer = 0 to theArray.Ubound-1
  // compare the current member with the next member
  // if they are the same, remove the current member
  if theArray(i) = theArray(i + 1) then
    theArray.Remove(i)
  end if
next

When I check the results most of the duplicates are still there. I think it may be to do with the counting of array members. The method uses an incrementing integer i, and if I remove the i-th member, does the next member get assigned i as its position in the array? If not, does that mean that the counter “skips over” the next member?

Also, why doesn’t it give an error at the Ubound end of the array? If I have made the array smaller by removing members, the value of Ubound(theArray) will be wrong, won’t it?

All rather confusing. Can someone set me straight on this please?

Ian.

Step backwards through the array.

Dim n as integer = ubound( theArray )
For l as integer = n downto 1
  if theArray( i ) = theArray( i - 1 ) then
    theArray.remove i
  end if
next

Typed directly into the forum without testing, but it should give you an idea.

Because you’re calculating the loop end on each loop. Storing it into a variable and working backwards should prevent this and should improve the performance because it’s not having to count the number of items in each loop.

Think of it as a stack, you pull one out and the one above falls down into that position.

Edit: This also only works if the array is sorted so that the duplicates are next to each other. If they’re not, you can either sort the array or use a dictionary (with dictionary.hasKey) to see if you’ve already seen that item or one of the same value.

Yes, that is the problem. If you’re going to remove elements, you have to go from Ubound downto 0. Otherwise, you will skip elements of the array completely.

Edit - Sam beat me to it.

2 Likes

The following sample code might help you.

Dim a(-1) As Int32
Dim previousValue As Int32
Dim count, i As Int32

a.Append(5)
a.Append(5)
a.Append(6)
a.Append(7)
a.Append(7)
a.Append(7)
a.Append(8)
a.Append(9)
a.Append(9)
a.Append(10)
a.Append(10)

count = UBound(a)

'filter if we have more than 1 entry
If count > 0 Then
'remember the last entry
previousValue = a(count)

'iterate from the second last entry down to the first entry
count = count - 1

For i = count DownTo 0
  If a(i) = previousValue Then
    'the current entry matches the previous entry so remove it
    a.Remove(i)
  Else
    'the current entry does not match the previous entry so remember the current entry's values
    previousValue = a(i)
  End If
Next

End If

Break

Var theArray() as string

theArray.add("Ian")
theArray.add("Ian")
theArray.add("Jeannot")
theArray.add("Paul")
theArray.add("Geoff")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Greg")
theArray.add("Jeannot")
theArray.add("Jeannot")

theArray.Sort


for i as Integer = 0 to theArray.Ubound-1
  // compare the current member with the next member
  // if they are the same, remove the current member
  if theArray(i) = theArray(i + 1) then
    theArray.RemoveAt(i)
    i = i - 1
  end if
next

break

This will work as well and helps probably understanding the challenge. But the better way is to iterate backward, like the other colleagues suggested.

Thanks everyone for the replies. I tried Sam’s (and Tim’s) method and that works perfectly. And in fact the original page that I referred to iterates down from the top of the array to the bottom. I naively thought that this was the same as going from the bottom to the top.

Thanks for the quick replies everyone.

2 Likes

Well it is more or less the same, but not from an “index perspective” ;-), as you are deleting the duplicate rows. If you would buffer the ubound variable (or better user

theArray.LastRowIndex

as uBound is obsolete in API v2, you will see the effect in the debugger.

I wrote a Method to deduplicate that uses a memory-based SQLite database!

This allowed me to either remove EVERY duplicate that has a pair, or remove just the excess duplicates (keeping one). It also allowed me to keep the original order of the String items (not need to pre-sort) if the original order must be maintained.

I know it is overkill for the OP, but someone might find it useful in years to come!

Protected Function getRemoveDuplicatesWAD(SearchString As string, Delimiter As String, removeAllDuplicates As Boolean = False) as String
  Var tempArray(-1) As String
  Var tempSQLiteDatabase As New SQLiteDatabase
  Var rs As RowSet
  Var rec As New DatabaseRow
  
  if not tempSQLiteDatabase.Connect then Return ""
  tempSQLiteDatabase.ExecuteSQL("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, TextItem TEXT COLLATE NOCASE)")
  
  'add every item to database
  tempArray = SearchString.SplitBytes(Delimiter)
  for tempInt As Integer = 0 to tempArray.LastRowIndex
    rec = New DatabaseRow
    rec.Column("TextItem").StringValue = tempArray(tempInt)
    tempSQLiteDatabase.AddRow("data", rec)
  next
  
  'find the duplicates
  if removeAllDuplicates then
    tempSQLiteDatabase.ExecuteSQL("DELETE FROM data WHERE TextItem IN(SELECT TextItem FROM data GROUP BY TextItem HAVING COUNT(*) > 1) ") 'remove every item that's a duplicate
  else
    tempSQLiteDatabase.ExecuteSQL("DELETE FROM data WHERE NOT id IN(SELECT MIN(id) FROM data GROUP BY TextItem) ") 'remove duplicates, but keep the first one found in the list
  end if
  
  ReDim tempArray(-1)
  Try
    rs = tempSQLiteDatabase.SelectSQL("SELECT TextItem FROM data ORDER BY id ")
  Catch Error
    CommonWindow.doDisplayErrorWAD(Error.Message, CurrentMethodName, Error.ErrorNumber)
    Return ""
  End Try
  
  If rs = Nil Then
    Return ""
  End If
  
  while not rs.AfterLastRow
    tempArray.AddRow(rs.Column("TextItem").StringValue)
    
    rs.MoveToNextRow
  wend
  
  Return String.FromArray(tempArray, Delimiter)
    
End Function
2 Likes

Grab ArrayUtils from http://www.strout.net/info/coding/rb/intro.html. An oldie but goldie.

About performance, this came up before many years ago on the mailing list and was tested in Real Studio. It was found that it really made no difference, so I decided to test again.

My code creates an integer array of 10M elements, then assigns its index to each element and calls a function. Example:

#if not DebugBuild
  #pragma BackgroundTasks False
  #pragma NilObjectChecking False
  #pragma StackOverflowChecking False
  #pragma BoundsChecking False
#endif

dim msg as string
dim sw as new Stopwatch_MTC
sw.Start

var arr( kLastRowIndex ) as integer
for i as integer = 0 to arr.LastRowIndex
  arr( i ) = i
  sw.Lap
next

sw.Stop
msg = "Last Row: " + sw.ElapsedMilliseconds.ToString( "#,##0.0##" ) + " ms, Avg: " + sw.AverageLapMicroseconds.ToString( "#,##0.0##" ) + "µs"
AddToResult msg

The results show there really isn’t much difference and, unless you are trying to squeeze out every last microsecond, you don’t need to bother.

The first line is when using the code above, the second is when caching LastRowIndex to a variable first.

1 Like