I guess we’ll find out. ![]()
Very good work. I’m confused though - why are any locks required at all?
Wouldn’t it be possible to completely segregate each thread like this:
- Thread 1 : reads first 1/N bytes of data file into its own memory block, processes it
- Thread 2 : reads second 1/N bytes… processes it
- Thread N : reads final 1/N bytes… processes it
Each thread then does a call to Thread.AddInterfaceUpdate, passing a Dictionary of its results. Each dictionary will have 413 entries. The final step is just adding these up, which will take basically zero time and doesn’t need to be done in a thread at all.
The official list specifies a certain Intel processor and uses at max 8 threads. I’m guessing what we are seeing is using more than 8 threads and may be running on an Apple Silicon CPU, which would be cheating (a little bit).
Still very exciting to see ![]()
These are probably locks in the Xojo framework when dealing with memory management.
I looked at Kem’s code (the Optimization_KemTekinay_MT module) and it looks like it is using Semaphores to coordinate across threads, e.g.
RowsSemaphore.Signal
self.Rows = self.Rows + nRows
logProgress "Reading row " + Str(self.Rows)
RowsSemaphore.Release
While I believe that this specific 1BRC algorithm could be done without that cross-thread synchronization, a major goal of this exercise is exercising the new Xojo Preemptive threads, so it’s great that Kem is doing that!
The rules limit us to eight threads. And the locks are internal to the framework.
Looking at @Kem_Tekinay 's latest code, I see some nice improvements from my last (md_5) version:
- I used a function to parse the temperature data; Kem’s version has inlined it so no function call is needed.
- I used a separate hash function : Kem has inlined it - and looks like a faster algorithm(?)
- my custom Dictionary uses “open hashing with linear search”, his uses “closed hashing” with an array. The difference is “what do you do when two station names hash to the same bucket index?”
In open hashing with linear search:
- keep checking the next bucket until you find an open one
In closed hashing:
- each bucket is actually a collection (some algorithms use a Linked List, but Kem’s is just using an array). Simply add the item to the array
You don’t read the same complex source (as a file) at the same time or some crash or side effect will arise, a synchronization, waiting queue, must exist; for example.
Hardly you do any task sharing any resource without using some locking. The operation must be “atomic”, i.e. unable of being interrupted during the operation and the control being passed to another thread without completing the entire consistent bit of task.
My hunch is that with read-only access to a file, the OS handles this for us.
For example there’s zero problem with multiple different Processes reading from the same file.
I would assume that it’s true if multiple Threads within the same process are doing the same. But I would also assume that you need to use separate objects, e.g. each thread needs to open the file itself and create its own BinaryStream, rather than trying to share one BinaryStream across threads.
As I say, it’s a hunch, so I may be wrong.
It handles sending you error conditions. Exceptions. When what you said it to do is in some broken condition.
That’s one possible abstraction. Due to isolation. Each thread owns a “complex structure” taking care of the read-only resource, so no “current seek position” overlaps, for example, would occur. Opening the same file from multiple places from the same process is a possibility, that depends of the ability of the underlying OS. I guess most, if not all, allows that depending only on the mode set at low level.
I’m testing out my own version using Preemptive threading. In my version, I’m using zero Semaphores or CriticalSections while the data is being processed (saving that for the very end step).
I’m basically seeing the same as Kem, and my 8 thread version, which should be 8x14 = 112x as fast as baseline, is running about 1x baseline, which means each thread is running at 1% speed (one-one hundredth as fast as it should)
I’m also seeing a second problem where the Main thread is not sleeping properly, which suggest that https://tracker.xojo.com/xojoinc/xojo/-/issues/77567 is not fixed (or perhaps I’ve found a new way to trigger it).
Keep in mind that that was just one iteration of what I tried when I thought the issue was my code and not the framework.
I reported the Thread.SleepCurrent() issue here: https://tracker.xojo.com/xojoinc/xojo/-/issues/77691
Also ran into another bug, this one is a crasher, appears to be a mutex problem within the framework with Preemptive threads. I can reproduce it even if I’m only running 1 thread, and not using any semaphore or criticalSections in my own code: https://tracker.xojo.com/xojoinc/xojo/-/issues/77696
A post was split to a new topic: Xojo1BRC (Xojo One Billion Row Challenge) - 2024r4
It is a real thread to read your report, but there is a missing: the woman point of view.
No woman chime in and show they can do too (and do as good as anyone if not better) ?
C’mon Ladies…
Now that 2024R4 is out, let’s see how fast it is:
Tests on Intel i9:
Xojo 2024 R3.1
1.0 Baseline 328,000 rows/second
16.5x MD_6 6,370,000 rows/second
0.8x MD_6 MT 329,000 rows/second (8 preemptive threads)
### Multithreaded is much much slower in 2024R3.1 !
Xojo 2024 R4
1.0 Baseline 425,000 rows/second
12.5x MD_6 5,314,000 rows/second
60.4x MD_6 MT 25,727,000 rows/second (8 preemptive threads)
Testing on my new M4 Pro:
Xojo 2024 R3.1 - M4 Pro (10/4) - Apple Silicon
1.0 Baseline 1,115,000 rows/second (3.5x faster for Arm vs. Intel)
12.2x MD_6 14,000,000 rows/second
1.2x MD_6 MT 1,323,000 rows/second (8 preemptive threads)
### Multithreaded is much much slower in 2024R3.1 !
Xojo 2024 R4 - M4 Pro (10/4) Apple Silicon
1.0 Baseline 1,253,000 rows/second
13.0x MD_6 16,290,000 rows/second
55.3x MD_6 MT 69,233,000 rows/second (8 preemptive threads)
Conclusions:
- 2024 R4 fixes some major performance issues in preemptive threading vs. 2024 R3.1
- The M4 is fast - over 2.5x as fast on a per-core basis as Intel i9
I returned to this after not having looked for a few months and started my own project from scratch. I’ll make that available on GitHub soon.
Limiting the number of threads to 8, I get these results:
Parsed 100,000,000 rows in 1.211s - 82,573,257.96 rows per sec
Gathered data for 413 rows in 0.004 s
Output in 0.012 s
This scales so with 15 threads (of my 16 cores), I get:
Parsed 100,000,000 rows in 0.753s - 132,865,866.92 rows per sec
Gathered data for 413 rows in 0.005 s
Output in 0.012 s
BTW, although I didn’t consult the existing project, I did use some things I learned from @Mike_D .
$ : ./OneBillion/OneBillion
Parsed 1,000,000,000 rows in 8.698s - 114,969,946.98 rows per sec
Gathered data for 413 rows in 0.006 s
Output in 0.014 s
$ : ./OneBillion/OneBillion --jobs=8
Parsed 1,000,000,000 rows in 13.618s - 73,431,626.44 rows per sec
Gathered data for 413 rows in 0.005 s
Output in 0.013 s
![]()