Fastest case sensitive equality test

I can think of three ways of testing for equality between strings with case sensitivity. They are using the String.Compare function, an encoding such as EncodeHex, or a hashing such as SHA256. (This is because a Xojo string remembers case but doesn’t operate on them with case sensitivity.) The EncodeHex method seems best, with the following results:

6.372 seconds for 100,000 String.Compare
0.0220001 seconds for 100,000 HexEncoding comparisons
1.605 seconds for 100,000 Hashing comparisons
0.01 seconds for 100,000 case-insensitive string comparisons

These results were obtained after running the following code:

dim strTest1 as string = "Test"
dim strTest2 as string = "Test" 
dim intRounds as integer
dim strMessage as string

dim dblSeconds as double  =xojo.core.date.Now.SecondsFrom1970
do until intRounds = 10000
  intRounds = intRounds + 1
  if strTest1.Compare(strTest2, ComparisonOptions.CaseSensitive) <> 0 then
    break
  end if
loop
strMessage = str(xojo.core.date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 100,000 String.Compare" 


dblSeconds =xojo.core.date.Now.SecondsFrom1970
intRounds = 0
do until intRounds = 10000
  intRounds = intRounds + 1
  If EncodeHex(strTest1) <> EncodeHex(strTest2) then
    break
  end if
loop

strMessage = strMessage + chr(13) + str(xojo.core.date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 100,000 HexEncoding comparisons" 

dblSeconds =xojo.core.date.Now.SecondsFrom1970
intRounds = 0
do until intRounds = 10000
  intRounds = intRounds + 1
  If crypto.SHA512(strTest1) <> crypto.SHA512(strTest2) then
    break
  end if
loop
strMessage = strMessage + chr(13) + str(xojo.core.date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 100,000 Hashing comparisons" 

dblSeconds =xojo.core.date.Now.SecondsFrom1970
intRounds = 0
do until intRounds = 10000
  intRounds = intRounds + 1
  If strTest1 <> strTest2 then
    break
  end if
loop
strMessage = strMessage + chr(13) + str(xojo.core.date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 100,000 case-insensitive string comparisons"

TextArea1.Text = strMessage

The question is why?

EncodeHex ignores string encoding and operates directly on the bytes of the string. String.Compare has to evaluate the bytes in light of the encoding, so it will take longer. I’m a little surprised by your results of a straight string comparison. Seems like it shouldn’t be the fastest.

I guess the standard (case insensitive) equality test is optimised byte for byte where the encodings are the same.

Using EncodeHex with Extends also allows you to add the encoding to a string more easily to produce more readable code in the usual manner like this which cannot be done with String.Compare:

If strTest1.EncodeAsHex = strTest2.EncodeAsHex then

The Module’s method extends strings like this:

Function EncodeAsHex(Extends str As String) As string
  return  EncodeHex(str)
End Function

(Although for some reason, in our copy of 2020r3.2 I had to toggle the Private and Public Scopes in the Method’s menu before it was recognised in other parts of the project.)

What platforms?
You could also use NSStringCompareMBS function on macOS and iOS to compare with various options.

what time will be two to lower case compare?

The output says 100,000, but the code uses 10,000. You’re sure you did the same number of comparisons for each test?

(The code above says you did, just making sure.)

Also, was this in the ide or compiled?

I just tested here compiled and get similar results.

Faster than Hex:

var mb1 as MemoryBlock = s1
var mb2 as MemoryBlock = s2
if mb1 <> mb2 then ...

My results:

0.0237651 seconds for 10,000 String.Compare
0.0039558 seconds for 10,000 HexEncoding comparisons
0.6085248 seconds for 10,000 Hashing comparisons
0.0000231 seconds for 10,000 case-insensitive string comparisons
0.003612 seconds for 10,000 MemoryBlock comparisons

I thought so too, but a quick test showed that Compare does not consider encoding.

I converted one of the strings to UTF-16 and String.Compare no longer showed them as equal.

Here’s my attempt:

  • built in 2020R2.1
  • 64 bit macOS
  • compiled app
  • using #pragma DisableBackgroundTasks
  • run on a core i9

Conclusions:

  • String.Compare() is very slow
  • String.Compare(CaseSensitive) is no faster than CaseInsensitive!
1 Like

I like the way you presented that, very useful.

I didn’t even think to check StrComp as I thought it would be about the same as Compare. This could use an explanation from the engineers as it what it’s doing differently.

Did you use Aggressive compile?

Checking it in Instruments (which is a fantastic tool, by the way) it looks like most of the time is being wasted doing a Runtime Stack Check inside the TextEncoding.OperatorCompare function:

Edit to add: Correction, about 60% of the time is wasted in the stack check, but even with that removed, RuntimeCompareTextWithOptions still looks rather slow…

Default. Since the time is being wasted inside Xojo framework calls, I don’t think that the optimization level will matter much.

Here is the same test using Aggressive:

My conclusions: API2 string operations are not really ready for prime time yet:

  • ComparisonOptions.CaseInsensitive is broken
  • ComparisonOptions.CaseSensitive is more than 100x slower than StrComp(binary)

That’s a pretty interesting topic. Within my App I have a routine which checks the letters of every single String for Upper-/Lowercase to prepare them for another process. I have the following function:

Private Function IsLowercase(value As String) As Boolean
  Var iResult As Integer = value.Compare(value.Lowercase, ComparisonOptions.CaseSensitive)
  Return iResult = 0
End Function

Yeah, this function will use some time for large String to proceed each single character within a loop. Since it looks like String.Compare is not well optimized in this topic, do you have another suggestion to speed it up by not using String.Compare?

Thanks

Ah great. So StrComp was replaced with String.Compare because it’s prettier, and we get a much slower implementation. Fantastic…

4 Likes

What do you do with a letter identified as one case or another?

I need this check to enable a certain graphical output where a source String is then drawn character by character.

P.S. I don’t want to use a RegEx for this :wink: .

1 Like