Xojo1BRC (Xojo One Billion Row Challenge)

Kem_Tekinay · October 25, 2024, 1:22pm

I guess we’ll find out.

Mike_D · October 25, 2024, 2:49pm

Very good work. I’m confused though - why are any locks required at all?

Wouldn’t it be possible to completely segregate each thread like this:

Thread 1 : reads first 1/N bytes of data file into its own memory block, processes it
Thread 2 : reads second 1/N bytes… processes it
Thread N : reads final 1/N bytes… processes it

Each thread then does a call to Thread.AddInterfaceUpdate, passing a Dictionary of its results. Each dictionary will have 413 entries. The final step is just adding these up, which will take basically zero time and doesn’t need to be done in a thread at all.

Mike_D · October 25, 2024, 2:53pm

The official list specifies a certain Intel processor and uses at max 8 threads. I’m guessing what we are seeing is using more than 8 threads and may be running on an Apple Silicon CPU, which would be cheating (a little bit).

Still very exciting to see

kevin_g · October 25, 2024, 2:55pm

These are probably locks in the Xojo framework when dealing with memory management.

Mike_D · October 25, 2024, 3:01pm

I looked at Kem’s code (the Optimization_KemTekinay_MT module) and it looks like it is using Semaphores to coordinate across threads, e.g.

RowsSemaphore.Signal
self.Rows = self.Rows + nRows
logProgress "Reading row " + Str(self.Rows)
RowsSemaphore.Release

While I believe that this specific 1BRC algorithm could be done without that cross-thread synchronization, a major goal of this exercise is exercising the new Xojo Preemptive threads, so it’s great that Kem is doing that!

Kem_Tekinay · October 25, 2024, 3:06pm

The rules limit us to eight threads. And the locks are internal to the framework.

Mike_D · October 25, 2024, 3:12pm

Looking at @Kem_Tekinay 's latest code, I see some nice improvements from my last (md_5) version:

I used a function to parse the temperature data; Kem’s version has inlined it so no function call is needed.
I used a separate hash function : Kem has inlined it - and looks like a faster algorithm(?)
my custom Dictionary uses “open hashing with linear search”, his uses “closed hashing” with an array. The difference is “what do you do when two station names hash to the same bucket index?”

In open hashing with linear search:

keep checking the next bucket until you find an open one

In closed hashing:

each bucket is actually a collection (some algorithms use a Linked List, but Kem’s is just using an array). Simply add the item to the array

Rick_Araujo · October 25, 2024, 3:27pm

You don’t read the same complex source (as a file) at the same time or some crash or side effect will arise, a synchronization, waiting queue, must exist; for example.

Hardly you do any task sharing any resource without using some locking. The operation must be “atomic”, i.e. unable of being interrupted during the operation and the control being passed to another thread without completing the entire consistent bit of task.

Mike_D · October 25, 2024, 3:32pm

My hunch is that with read-only access to a file, the OS handles this for us.

For example there’s zero problem with multiple different Processes reading from the same file.

I would assume that it’s true if multiple Threads within the same process are doing the same. But I would also assume that you need to use separate objects, e.g. each thread needs to open the file itself and create its own BinaryStream, rather than trying to share one BinaryStream across threads.

As I say, it’s a hunch, so I may be wrong.

Rick_Araujo · October 25, 2024, 3:34pm

It handles sending you error conditions. Exceptions. When what you said it to do is in some broken condition.

Rick_Araujo · October 25, 2024, 3:51pm

That’s one possible abstraction. Due to isolation. Each thread owns a “complex structure” taking care of the read-only resource, so no “current seek position” overlaps, for example, would occur. Opening the same file from multiple places from the same process is a possibility, that depends of the ability of the underlying OS. I guess most, if not all, allows that depending only on the mode set at low level.

Mike_D · October 25, 2024, 4:38pm

I’m testing out my own version using Preemptive threading. In my version, I’m using zero Semaphores or CriticalSections while the data is being processed (saving that for the very end step).

I’m basically seeing the same as Kem, and my 8 thread version, which should be 8x14 = 112x as fast as baseline, is running about 1x baseline, which means each thread is running at 1% speed (one-one hundredth as fast as it should)

I’m also seeing a second problem where the Main thread is not sleeping properly, which suggest that https://tracker.xojo.com/xojoinc/xojo/-/issues/77567 is not fixed (or perhaps I’ve found a new way to trigger it).

Kem_Tekinay · October 25, 2024, 5:21pm

Keep in mind that that was just one iteration of what I tried when I thought the issue was my code and not the framework.

Mike_D · October 25, 2024, 8:44pm

I reported the Thread.SleepCurrent() issue here: https://tracker.xojo.com/xojoinc/xojo/-/issues/77691

Mike_D · October 26, 2024, 3:16pm

Also ran into another bug, this one is a crasher, appears to be a mutex problem within the framework with Preemptive threads. I can reproduce it even if I’m only running 1 thread, and not using any semaphore or criticalSections in my own code: https://tracker.xojo.com/xojoinc/xojo/-/issues/77696

Gavin_Smith · November 14, 2024, 3:02pm

A post was split to a new topic: Xojo1BRC (Xojo One Billion Row Challenge) - 2024r4

Emile_Schwarz · November 14, 2024, 3:00pm

It is a real thread to read your report, but there is a missing: the woman point of view.

No woman chime in and show they can do too (and do as good as anyone if not better) ?

C’mon Ladies…

Mike_D · December 11, 2024, 1:27am

Now that 2024R4 is out, let’s see how fast it is:

Tests on Intel i9: 

Xojo 2024 R3.1
 1.0  Baseline   328,000 rows/second
16.5x MD_6     6,370,000 rows/second
 0.8x MD_6 MT    329,000 rows/second (8 preemptive threads)  
### Multithreaded is much much slower in 2024R3.1 !

Xojo 2024 R4
 1.0  Baseline   425,000 rows/second 
12.5x MD_6     5,314,000 rows/second
60.4x MD_6 MT 25,727,000 rows/second (8 preemptive threads)

Testing on my new M4 Pro:

Xojo 2024 R3.1 - M4 Pro (10/4) - Apple Silicon
 1.0  Baseline   1,115,000 rows/second (3.5x faster for Arm vs. Intel)
12.2x MD_6      14,000,000 rows/second
 1.2x MD_6 MT    1,323,000 rows/second (8 preemptive threads)
### Multithreaded is much much slower in 2024R3.1 !

Xojo 2024 R4 - M4 Pro (10/4) Apple Silicon
 1.0  Baseline   1,253,000 rows/second 
13.0x MD_6      16,290,000 rows/second
55.3x MD_6 MT   69,233,000 rows/second (8 preemptive threads)

Conclusions:

2024 R4 fixes some major performance issues in preemptive threading vs. 2024 R3.1
The M4 is fast - over 2.5x as fast on a per-core basis as Intel i9

Kem_Tekinay · February 18, 2025, 5:33pm

I returned to this after not having looked for a few months and started my own project from scratch. I’ll make that available on GitHub soon.

Limiting the number of threads to 8, I get these results:

Parsed 100,000,000 rows in 1.211s - 82,573,257.96 rows per sec
Gathered data for 413 rows in 0.004 s
Output in 0.012 s

This scales so with 15 threads (of my 16 cores), I get:

Parsed 100,000,000 rows in 0.753s - 132,865,866.92 rows per sec
Gathered data for 413 rows in 0.005 s
Output in 0.012 s

BTW, although I didn’t consult the existing project, I did use some things I learned from @Mike_D .

Kem_Tekinay · February 19, 2025, 5:02am

$ : ./OneBillion/OneBillion
Parsed 1,000,000,000 rows in 8.698s - 114,969,946.98 rows per sec
Gathered data for 413 rows in 0.006 s
Output in 0.014 s
$ : ./OneBillion/OneBillion --jobs=8
Parsed 1,000,000,000 rows in 13.618s - 73,431,626.44 rows per sec
Gathered data for 413 rows in 0.005 s
Output in 0.013 s