Missing "Generics" in Xojo

To manage large arrays of records, I’ve implemented my own version of the QuickSort algorithm.

  • The sort is happening based on the current machine locale (for example on Cocoa with CFStringCompare).
  • The sort can happen on a maximum of 3 keys ( = columns).
  • Each key can either sort ascending or descending independent of the two other keys.
  • Sorting is stable (this is done by adding the initial array position as a fourth key).
  • All common data types are supported (Boolean, Integer, Int64, Single, Double, Currency, Date, String).
  • MemoryBlocks and Objects can be used as column data types, though they can’t be used as sort keys.
  • Each column supports Nil values, which - dependent on a user preferences checkbox - will either be sorted at the beginning or end of the values in the array.
  • The sort performance is independent of the number of columns a record has, since an internal, separate index array is sorted only.

I’ve tweaked everything, so that 25000 records are sorted in approx. 1 second in the worst case on a 2.4 Core i7 (QuickSort on pre-sorted arrays are slow). I have an older iMac from 2008, where it takes up to 1,7 seconds in the worst case. So I’m quite satisfied with the performance.

The main gain on speed was by replacing Variants with the specific data types in the array. This was done by changing from row-based arrays to column-based arrays. The Nil values are actually flags, for example a Nil in a Integer column is either -2,147,483,648 or 2,147,483,647 (depending where the user wants the Nil values to appear). For string values &u0 and &u100EF are used.

The other change was to create typed QuickSort and Compare methods.

Due to the missing “generics” capability, I now have to write 8 x 8 x 8 = 512 QuickSort methods and 512 Compare methods. Does somebody have any ideas, as unconventional they might be, how to circumvent that - without loosing too much speed?

As I understand this, the reason you need 8x8x8 is that you have 8 data types and 3 columns, and I presume that your code sorts all 3 columns at once (e.g. you are creating a pseudo key with the value Column1+Column2+Column3 )

What about simply handling a 3 column sort in 3 steps: sort by column 1, then by column 2, then by column 3. For the 2nd and 3rd sorts you’d have to keep track of the prior pre-sorted key so that you only sort within values of that key that are the same.

Performance would be only 1/3 as fast in the worst-case scenario, which is not great. However, you could also extend this technique to any number K of columns to sort by.

I tried something like what you’re proposing, but this was either slower than using Variants (the case where you only need one QuickSort and one Compare method), or it got extremely complicated to just sub-sort key2 values on equal key1 values, etc.

I’m now going for a reduced set of QuickSort and Compare methods, one for each combination of data types, plus an additional order flag. As an example:

Compare(value11 As String, value12 As String, descending1 As Boolean, value21 As Integer, value22 As Integer, descending2 As Boolean, value31 As Integer, value32 As Integer, descending3 As Boolean, order As Integer)
This compares a string column and two integer column. The “order” argument decides in which order they are compared. So the above method covers for:

  • String, Integer, Integer (first key, second key, third key)
  • Integer, String, Integer
  • Integer, Integer, String

Still not perfect, but much less code to write.

I bet you’d get a bigger speed increase if you inline your compare function.

That being said, you can reduce it from 8 data types to fewer by using some automatic type conversion rules, e.g.:

  • any integer type (including Currency) will fit within an Int64 - so you can have simply use Int64s for all integer comparisons
  • Singles will fit within Doubles, so Doubles can handle both types

With these changes you are down to 5 types: (Boolean, Int64, Double, Date, String)

I will go with inlining the compare function, that’s a really good idea - I was staring at my code for hours and didn’t no once think of this.

A little off topic, but also look up “middle of three” for better performance when the data is already sorted or mostly sorted.

Using a stable sort algorithm would make it easier to do since you could sort by key1, then key2, then key3
http://en.wikipedia.org/wiki/Category:Stable_sorts
Quicksort isn’t so sorting by multiple keys requires special care

Which raises the question: is the default SortWith stable?

Good question - and I honestly don’t know but I don’t think so
Looks like it uses QuickSort

Radix is a fun one and pretty darn quick in it own right & IS stable
Doing it generically can be a challenge BUT if you know the data your sorting can be readily written