In PHP there is a function array_diff that is useful for determining which elements in array 1 are not present in array 2. I’d love to be able to do the same thing in Xojo, but the best my little brain has come up with is to loop through array 1 and do an indexOf each element against array 2 to see if it exists in that array.
Is this the most efficient way to do it?
While intuitively it seems to have a O(n) runtime (where n is the number of elements in array 1) I wonder if there is a faster way to calculate the intersection or difference between 2 arrays. In my tests, the performance is not as fast as I’d like. I can live with it if I have to, but for large arrays (10k items or more) this causes a noticeable delay while it calculates.
You could use a dictionary. Loop through the first array and add each value to the dictionary using the value as the key (and an arbitrary value for the value e.g. 1).
Then you simply loop through the second array in a single pass and use the Dictionary.HasKey(Value) with each value to verify it effectively exists in array 1. Not sure if this has any performance benefit though, but I think dictionaries uses hashes to do lookups which is usually very fast.
@Alwyn Bester I think that creating the dictionary and looping through 2 arrays will be slower than a single loop through 1 array with a lookup for each element into the second array.
Turns out that the slowest part of the whole thing is actually populating the arrays (they are filenames from items on disk) so my best optimization efforts will be spent looking at faster ways to get filenames from folderitems. I’ll start a new thread about that.
I think your double array lookup is more like O(n*n) or O(n^2), which is not good.
Using a Dictionary would be O(N) to populate the Dictionary and O(N) to lookup values in the Dictionary (since a Dictionary lookup is a fast hash table lookup, it is way faster than an array IndexOf, which essentially is just a loop through all the array elements). O(2N) is faster than O(N^2).
It’s been a while since I’ve thought about O notation, so I may have it all wrong, of course.
I’m not sure I agree that my approach is much worse than O(n). Course, like you, it’s been years since I’ve really concerned myself with this stuff… but as I recall, you figure your bound by calculating how many times you have to visit or test an item. In my case, I test each item in only one of the arrays exactly one time, thus my runtime is O(n), where n is the number of items in one of the arrays.
However, this ignores the cost of the specific test for each item. Now, if the test must visit each item in array 2, then my runtime will be O(n^m), where n is the size of the first array, and m is the size of the second array. This is terrible. But I suspect that indexOf short circuits and returns a result as soon as it finds a matching value, so it only has to visit the elements in the array until it finds the match. That will be on average 50%ish of the array (assuming nearly all the values are present in both arrays, as is the case with my data), or so my addled brain tells me. So that would yield O(n^(1/2 * m)), which is still not good.
Others have suggested using a dictionary for this test, as lookups are faster due to the inherent hash tables in Dictionaries for their keys. However, this forgets that in order to do this, I have to take the 2 arrays of String() that I already have, and loop through them BOTH to create 2 new dictionaries, then do the lookups by looping through one of the dictionaries to do the hasKey() against the other dictionary for each item in the first.
Now, if I was starting with 2 dictionaries in the first place, this is the clear winning approach. I’m not sure that the overhead involved in creating 2 new dictionaries and populating them is low enough to still maintain a time savings with this approach.
I’m going to give it a whirl, though. My data set is roughly 3100 folderitem names, with maybe 1-10 or so not present in one of the arrays. I’ll post results shortly.
Ok, I stand very corrected. Using a Dictionary instead of an array is VERY worth the overhead to set up the dictionary. I should note that only 1 dictionary is needed, not 2. Here is some code to show you what I’m doing:
//Fetch the UUIDs I care about from the db
dim rs as RecordSet = DB.selectQuery("select * from Exam")
dim f as FolderItem = app.mainWindowBusiness.getPathToThumbs()
dim folders() as String = f.ItemNames() //uses Karen's code for VERY fast item name listing, as a String()
dim foldersDict as new Dictionary //declare a new dictionary for my lookup
for i as integer = 0 to ubound(folders())
//populate the dictionary with the strings in folders()
foldersDict.value(folders(i)) = ""
next
//create an array to hold the UUIDs I don't find in the dictionary and need to be updated.
dim UUIDsToUpdate() as String
while rs <> nil and not rs.eof
//walk through the recordSet and test against the dictionary
dim uuid as String = rs.field("uuid").StringValue
if not foldersDict.hasKey(uuid) then
//found one that needs to be updated
UUIDsToUpdate.append(uuid)
end if
rs.MoveNext
wend
This whole process takes about 1/10th of a second to run on an old windows 7 laptop with 3100ish folder items. Before I started optimizations, it was taking around 45 seconds.