Fastest case sensitive equality test

Kem_Tekinay · February 11, 2021, 5:22pm

Off the top of my head:

var chars() as string = s.Split( "" )
var upperChars() as string = s.Uppercase.Split( "" )

for index as integer = 0 to chars.LastIndex
  if chars( index ).Asc = upperChars( index ).Asc then
    // uppercase
  else
    // lowercase
  end if
next

Martin_T · February 11, 2021, 5:30pm

Thanks @Kem_Tekinay. Interesting, using

Private Function IsLowercase(value As String) As Boolean
  Return value.Asc = value.Lowercase.Asc
End Function

instead of

Private Function IsLowercase(value As String) As Boolean
  Var iResult As Integer = value.Compare(value.Lowercase, ComparisonOptions.CaseSensitive)
  Return iResult = 0
End Function

makes in my test case a difference of 20 ms. Further suggestions for optimization?

Kem_Tekinay · February 11, 2021, 5:43pm

Note that s.Asc will return the code point for the first character only. Is that what you want?

Are you measuring the calling function too? If so, are you using Split the parse the characters?

Martin_T · February 11, 2021, 5:51pm

It doesn’t matter in my case, because I call the function inside a loop for single characters:

For Each char As String In source.Characters
  If IsLowerCase(char) Then ...
Next

That’s why it works wonderfully in my case. I have also tested it with Cyrillic letters etc. .

Kem_Tekinay · February 11, 2021, 5:55pm

The only optimization I see is to inline the test as function calls have overhead.

Edit: But that affects readability.

Martin_T · February 11, 2021, 5:57pm

Cool, thanks. However, this doesn’t make sense in my case, because I call the function in other methods as well and don’t want duplicate code

Kem_Tekinay · February 11, 2021, 6:02pm

I agree.

But in thinking about it more, where speed is important, you are calling the Lowercase function for each character of the string instead of just once on the whole thing. If your string is 100 character, it probably doesn’t make a difference. If it’s 100k, you might feel that.

Kem_Tekinay · February 11, 2021, 6:11pm

In other words, instead of converting each character to lowercase before the comparison, convert the string to lowercase, then compare each character to the corresponding character of the original.

anon93744516 · May 9, 2021, 12:53am

After reading the Leaking Locale and core.Local Objects? thread. I started thinking about string comparison performance and found the current thread.

And after reading through here, I was interested in how NSStringCompareMBS would preform, if included as another test case (based on the code in the original post).

So I added the following two test runs:

dblSeconds = Xojo.Core.Date.Now.SecondsFrom1970
intRounds = 0
Do Until intRounds = 10000
  intRounds = intRounds + 1
  If NSStringCompareMBS(strTest1, strTest2, 0) <> 0 Then
    Break
  End If
Loop
strMessage = strMessage + Chr(13) + Str(Xojo.Core.Date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 10,000 NSStringCompareMBS case-sensitive string comparisons"

dblSeconds = Xojo.Core.Date.Now.SecondsFrom1970
intRounds = 0
Do Until intRounds = 10000
  intRounds = intRounds + 1
  If NSStringCompareMBS(strTest1, strTest2, 1) <> 0 Then
    Break
  End If
Loop
strMessage = strMessage + Chr(13) + Str(Xojo.Core.Date.Now.SecondsFrom1970 - dblSeconds) + " seconds for 10,000 NSStringCompareMBS case-insensitive string comparisons"

And got the following results:

0.0206921 seconds for 10,000 String.Compare
0.0049689 seconds for 10,000 HexEncoding comparisons
0.5892110 seconds for 10,000 Hashing comparisons
0.0012398 seconds for 10,000 case-insensitive string comparisons
0.0022290 seconds for 10,000 NSStringCompareMBS case-sensitive string comparisons
0.0021710 seconds for 10,000 NSStringCompareMBS case-insensitive string comparisons

Note: I included the results of all tests, because I’m using Xojo 2021r1.1 on a 2018 Mac Mini (10.15.7) with 3.2 GHz 6-Core Intel Core i7 & 32Gb RAM.

My conclusion was, for case-insensitive string comparisons use the = or <> operators. And for case-sensitive matches, use NSStringCompareMBS - if available to you and appropriate.

I hope that is useful to someone. Thanks.

Sam_Rowlands · May 9, 2021, 2:29am

There must be a way to efficiently use memory blocks for this. Albeit I can’t think of one at the moment. I use memoryblocks for case sensitive “select case”.

Markus_Winter · May 9, 2021, 4:49am

If you have to use MemoryBlocks for standard operations then Xojo is doing something wrong.

Christian_Schmitz · May 9, 2021, 7:57am

Thanks for comparison.

For String.Compare I made a feedback case 64647 as it converts all Strings to Text and then does a compare, which makes it slower than it needs to be. And on macOS the compare may be with creating CFString internally (another copy in addition to text) to do the compare.

For NSStringCompareMBS similarly you have the overhead of a plugin function call, which is not efficient as it could be. (see 62010). And then our plugin will do CFString/NString comparison for you.

JensK · May 9, 2021, 7:57am

String.Compare may be slower than other methods, but in one project it was the only way for me to get a decent ordering of a ListBox containing “Umlaute”. (StrComp didn’t work properly.) I don’t know how else I could have it done so I am glad I have this option.

Kem_Tekinay · May 9, 2021, 11:31am

Unicode normalization issue?

JensK · May 9, 2021, 11:45am

I fear I don‘t know what that is.

Beatrix_Willius · May 9, 2021, 11:47am

You will learn the first time when strings that should be the same don’t match. I thought I was going crazy.

JensK · May 9, 2021, 11:52am

I will look into this as soon as it bugs me … (Fortunately I didn‘t have to deal with that sort of problems before.)

Kem_Tekinay · May 9, 2021, 12:04pm

In a nutshell, some characters can be represented in two different ways: one code point that represents the character, or a series of two or more code points.

To see this in action in Xojo, try this:

MessageBox "e" + &u0300

(I am greatly simplifying the issue here.)

Normalization is the process of getting all the character represented in the same way, either Composed (one code point) or Decomposed (two or more code points). Once you have achieved that consistency, things like sorts and searches will work properly.

My M_String project include normalization code, as does the MBS plugins.

http://www.mactechnologies.com/index.php?dowloads

Christian_Schmitz · May 9, 2021, 4:59pm

That is why there are flags like

const NSDiacriticInsensitiveSearch = 128
const NSWidthInsensitiveSearch = 256