Performance bug in macOS ARM builds?

GarryPettet · December 5, 2020, 10:57pm

So I’m one of the lucky ones who’ve managed to get an M1 MacBook Pro. Like most of the reviews, it’s great. I recompiled a few of my apps in 2020R2 for Apple Silicon and see up to 2x performance increase compared to my 2017 MacBook Pro.

My joy was short lived though as I think I’ve found an issue in the Xojo framework that is causing performance to be worse on ARM than Intel.

Put this code in the Open event of a new app:

Const iterations = 20000

Var a(iterations - 1) As String

Var start As Double = System.Microseconds
For i As Integer = 0 To iterations - 1
  a.Insert(i, "a")
Next i
Var total As Integer = (System.Microseconds - start) / 1000

MessageBox("Took " + total.ToString + " ms")

If I run (or build, it doesn’t matter) this app on my M1 Mac as x86 it takes 90 ms. If I run or build it for ARM on macOS it takes 450 ms. If I build it as a universal binary it takes 90 ms (the same as the Intel build).

This doesn’t seem right. Why would this super simple app run slower on an M1 Mac when compiled for ARM than when running through Rosetta as x86?

Can anyone verify this?

Here’s the example Xojo project:

Sam_Rowlands · December 5, 2020, 11:13pm

Memory interactions are upto 5x faster on ARM (so I’ve read), so this should be faster too.

If you pre-allocate the array to the size of iterations, and simply set the valueAtIndex then this would be a lot faster on both Intel and ARM, however that might defeat the purpose of this test.

Martin_T · December 5, 2020, 11:25pm

This is much faster.

Const iterations = 20000

Var a(iterations - 1) As String
Var iMax As Integer = iterations - 1

Var start As Double = System.Microseconds
For i As Integer = 0 To iMax
  a(i) = "a"
Next i
Var total As Integer = (System.Microseconds - start) / 1000

MessageBox("Took " + total.ToString + " ms")

GarryPettet · December 5, 2020, 11:26pm

Oh yeah I get that Sam. I’ve just been distilling the issue down to a reproducible bug.

From what I can tell, this should be faster on an M1 Mac when compiled for ARM than when ran on the same M1 Mac but compiled for Intel.

Maybe there’s a bug in the compiler??

GarryPettet · December 5, 2020, 11:29pm

Just to clarify people, I’m not looking for a faster algorithm. This is just demonstrating an issue.

This code should be faster when compiled for ARM than when compiled for Intel should it not?

Martin_T · December 5, 2020, 11:31pm

What happens if you run the customized code under M1 now?

GarryPettet · December 5, 2020, 11:38pm

OK @Martin_T, that reverses things. I upped the iterations by a factor of 10 (because it’s so fast) but this code:

Const iterations = 200000

Var a(iterations - 1) As String
Var iMax As Integer = iterations - 1

Var start As Double = System.Microseconds
For i As Integer = 0 To iMax
  a(i) = "a"
Next i
Var total As Integer = (System.Microseconds - start) / 1000

MessageBox("Took " + total.ToString + " ms")

Gives 25 ms on ARM and 35 ms on Intel.

Maybe the bug is in Array.Insert()?

Martin_T · December 5, 2020, 11:41pm

Could be.

By the way: You are already aware that by using Array.Insert you increase your array defined to 200000 entries to 400000 entries, right? Is this what you want?

I got these result with 10000000 entries on MacBook Pro 2017. Thats amazing.

Sam_Rowlands · December 5, 2020, 11:45pm

Can you use Activity Monitor and confirm that your M1 is running the ARM version?

GarryPettet · December 6, 2020, 12:04am

Ha no it wasn’t good spot.

I actually have no interest in using arrays in this way in the code I’m writing. I’ve just written a GapBuffer class and was going to benchmark it against using an array. I wrote the above code quickly to get an idea how long it would take using (what I assumed would be) the slower array method and then test my new class.

It was in writing this quick code that I discovered a difference in performance between ARM and Intel and thought, huh - that’s weird.

It doesn’t detract from the fact that it should always be faster when running native than when running through Rosetta unless there is some bug in the framework or compiler I guess.

This bug gets weirder and weirder. There’s definitely something smelly going on with Array.Insert() on ARM.

If you run this code on both ARM and Intel:

Const iterations = 50000
Var iMax As Integer = iterations - 1

// Test inserting into an empty / small array.
Var a() As String
Var start1 As Double = System.Microseconds
For i As Integer = 0 To iMax
  a.Insert(i, "a")
Next i
Var total1 As Integer = (System.Microseconds - start1) / 1000

Var message As String = "Inserting into small array: " + total1.ToString + " ms" + EndOfLine

// Test inserting into an array that has already been allocated space.
Var b(iterations - 1) As String
Var start2 As Double = System.Microseconds
For i As Integer = 0 To iMax
  b.Insert(i, "a")
Next i
Var total2 As Integer = (System.Microseconds - start2) / 1000

message = message + "Inserting into large array:  " + total2.ToString + " ms"

MessageBox(message)

It takes 550 ms for x86 and 2625 ms for ARM for array b. Array a is slightly faster than x86 (7 ms vs 10 ms).

What is going on here??

GarryPettet · December 6, 2020, 12:06am

Yep, confirmed.

GarryPettet · December 6, 2020, 9:07am

Bug filed: <https://xojo.com/issue/62970>

DerkJ · December 6, 2020, 9:14am

Did you try the Api2 function? AddAt ?

Sam_Rowlands · December 6, 2020, 9:31am

I forgot, you have the OAK, there’s code in there (check the System Information window in the demo app) that will tell you what architecture your computer is using (as well as what architecture your computer is).

GarryPettet · December 6, 2020, 9:47am

Good thought. Using the below code makes no difference

// Test inserting into an array that has already been allocated space using `AddAt`.
Var c(iterations - 1) As String
Var start3 As Double = System.Microseconds
For i As Integer = 0 To iMax
  c.AddAt(i, "a")
Next i
Var total3 As Integer = (System.Microseconds - start3) / 1000

GarryPettet · December 6, 2020, 9:51am

Screenshot 2020-12-06 at 09.50.50

ChristopheDV · December 6, 2020, 9:53am

This may also be related to slower FolderItem calls (see other thread) on ARM.

Please be aware that for Intel there a lot more compiler optimisations available (SSE, SIMD, MMX,…)
ARM only has NEON (If I am not mistaken). Maybe Xojo is not compiling to make use of those ARM optimisations?

ChristopheDV · December 6, 2020, 10:08am

You may get an answer from a Xojo dev to optimise your code.
Anyhow, I guess Xojo Inc needs to optimise ARM compiling for sure.

GarryPettet · December 6, 2020, 10:09am

My code doesn’t access the file system though.

Interesting. I hadn’t considered that option. If that is the case then there’s little incentive to build a universal binary at the moment if Rosetta can translate more efficient x86 code faster than the M1 can run Xojo’s un-optimised ARM code.

ChristopheDV · December 6, 2020, 10:26am

I know but the lack of ARM optimisation could be also affecting FolderItem.