Most Efficient Way To Calculate Difference Between 2 Arrays?

Kimball_Larsen · July 14, 2014, 5:49pm

In PHP there is a function array_diff that is useful for determining which elements in array 1 are not present in array 2. I’d love to be able to do the same thing in Xojo, but the best my little brain has come up with is to loop through array 1 and do an indexOf each element against array 2 to see if it exists in that array.

Is this the most efficient way to do it?

While intuitively it seems to have a O(n) runtime (where n is the number of elements in array 1) I wonder if there is a faster way to calculate the intersection or difference between 2 arrays. In my tests, the performance is not as fast as I’d like. I can live with it if I have to, but for large arrays (10k items or more) this causes a noticeable delay while it calculates.

Tim_Hare · July 14, 2014, 6:17pm

Are the arrays sorted in the same sequence?

Alwyn_Bester · July 14, 2014, 6:18pm

You could use a dictionary. Loop through the first array and add each value to the dictionary using the value as the key (and an arbitrary value for the value e.g. 1).

Then you simply loop through the second array in a single pass and use the Dictionary.HasKey(Value) with each value to verify it effectively exists in array 1. Not sure if this has any performance benefit though, but I think dictionaries uses hashes to do lookups which is usually very fast.

Kimball_Larsen · July 14, 2014, 6:30pm

Neither array is sorted in any particular order.

@Alwyn Bester I think that creating the dictionary and looping through 2 arrays will be slower than a single loop through 1 array with a lookup for each element into the second array.

Turns out that the slowest part of the whole thing is actually populating the arrays (they are filenames from items on disk) so my best optimization efforts will be spent looking at faster ways to get filenames from folderitems. I’ll start a new thread about that.

Michel_Bujardet · July 14, 2014, 6:34pm

Maybe ?

Join Array2 into Array2String
Dim FindPosition(-1)
For each item in Array1
if Instr(Array2String, Item) = 0 then append Item and its index in Array1 into FindPosition

At the end FindPosition contains all the items that are not in Array2 and their index in Array1

Tim_Hare · July 14, 2014, 6:38pm

I would expect a dictionary to be faster than IndexOf.

Paul_Lefebvre · July 14, 2014, 8:21pm

I think your double array lookup is more like O(n*n) or O(n^2), which is not good.

Using a Dictionary would be O(N) to populate the Dictionary and O(N) to lookup values in the Dictionary (since a Dictionary lookup is a fast hash table lookup, it is way faster than an array IndexOf, which essentially is just a loop through all the array elements). O(2N) is faster than O(N^2).

It’s been a while since I’ve thought about O notation, so I may have it all wrong, of course.

Kimball_Larsen · July 14, 2014, 8:32pm

I’m not sure I agree that my approach is much worse than O(n). Course, like you, it’s been years since I’ve really concerned myself with this stuff… but as I recall, you figure your bound by calculating how many times you have to visit or test an item. In my case, I test each item in only one of the arrays exactly one time, thus my runtime is O(n), where n is the number of items in one of the arrays.

However, this ignores the cost of the specific test for each item. Now, if the test must visit each item in array 2, then my runtime will be O(n^m), where n is the size of the first array, and m is the size of the second array. This is terrible. But I suspect that indexOf short circuits and returns a result as soon as it finds a matching value, so it only has to visit the elements in the array until it finds the match. That will be on average 50%ish of the array (assuming nearly all the values are present in both arrays, as is the case with my data), or so my addled brain tells me. So that would yield O(n^(1/2 * m)), which is still not good.

Others have suggested using a dictionary for this test, as lookups are faster due to the inherent hash tables in Dictionaries for their keys. However, this forgets that in order to do this, I have to take the 2 arrays of String() that I already have, and loop through them BOTH to create 2 new dictionaries, then do the lookups by looping through one of the dictionaries to do the hasKey() against the other dictionary for each item in the first.

Now, if I was starting with 2 dictionaries in the first place, this is the clear winning approach. I’m not sure that the overhead involved in creating 2 new dictionaries and populating them is low enough to still maintain a time savings with this approach.

I’m going to give it a whirl, though. My data set is roughly 3100 folderitem names, with maybe 1-10 or so not present in one of the arrays. I’ll post results shortly.

Kimball_Larsen · July 14, 2014, 8:46pm

Ok, I stand very corrected. Using a Dictionary instead of an array is VERY worth the overhead to set up the dictionary. I should note that only 1 dictionary is needed, not 2. Here is some code to show you what I’m doing:

//Fetch the UUIDs I care about from the db dim rs as RecordSet = DB.selectQuery("select * from Exam") dim f as FolderItem = app.mainWindowBusiness.getPathToThumbs() dim folders() as String = f.ItemNames() //uses Karen's code for VERY fast item name listing, as a String() dim foldersDict as new Dictionary //declare a new dictionary for my lookup for i as integer = 0 to ubound(folders()) //populate the dictionary with the strings in folders() foldersDict.value(folders(i)) = "" next //create an array to hold the UUIDs I don't find in the dictionary and need to be updated. dim UUIDsToUpdate() as String while rs <> nil and not rs.eof //walk through the recordSet and test against the dictionary dim uuid as String = rs.field("uuid").StringValue if not foldersDict.hasKey(uuid) then //found one that needs to be updated UUIDsToUpdate.append(uuid) end if rs.MoveNext wend

This whole process takes about 1/10th of a second to run on an old windows 7 laptop with 3100ish folder items. Before I started optimizations, it was taking around 45 seconds.

Thanks, everyone for your input.

Rick_Araujo · July 15, 2014, 4:05pm

If you need for any chance in the future to merge 2 lists and obtain the differences, here is one approach: https://drive.google.com/file/d/0B6w6JZen_vW2dTlTcVNEWlVsM2s/edit?usp=sharing

Stefan_Cyrus · July 18, 2014, 3:35pm

Here a solution without any dictionary.

For two arrays each with 1,000,000 integers (works also with strings) -> 0,333 sec.

link text