Xojo1BRC (Xojo One Billion Row Challenge)

Mike_D · October 15, 2024, 4:35pm

I think this problem is really amenable to parallelization - you can divide the data up into N chunks, and process each chunk in parallel completely independently.

Combining the N chunk results at the end is trivial (just update min, max, temp, and add the TempSum before computing averages).

I do wonder if this would be faster using Preemptive threads vs. Workers?

Kem_Tekinay · October 15, 2024, 4:42pm

I haven’t tested, but I can’t imagine a scenario where Worker would be faster because of the overhead of transferring data to and fro, and converting strings to classes.

Kem_Tekinay · October 15, 2024, 6:49pm

Not to mention that the Worker class is not available in console apps.

Julia_Truchsess · October 15, 2024, 7:28pm

You’re going to want a really fast printer!

Kem_Tekinay · October 15, 2024, 7:37pm

FWIW, my code above runs a little faster here than your v4. The baseline runs slower in your project, but I haven’t looked to see why.

Mike_D · October 15, 2024, 8:36pm

I see an odd slowdown running my v4 in the IDE, but in a built app it’s about 4x as fast. I’m on Intel, too. Are you testing a built app?

Kem_Tekinay · October 15, 2024, 8:39pm

Yes, only built apps using Default optimization. This is on an M3 Max.

Mike_D · October 15, 2024, 8:47pm

I’ve been using Moderate optimization all this time. Maybe Moderate is faster than Default?

However, I just realized I uploaded v4 with Aggressive left set by mistake above. Oops! Too late to edit my post.

Here it is with the proper Moderate setting:
[Edit - removed, I think it has a bug ]

In the IDE debug run, my results are confusing:

_md.1 is 1.3x faster than baseline
_md.2 is 0.4x faster than baseline
_md.3 is 0.4x faster than baseline
_md.4 is 0.3x faster than baseline  // the most optimized is the slowest!

but in the built app, I get this:

_md.1 is 1.2x faster than baseline
_md.2 is 1.6x faster than baseline
_md.3 is 2.2x faster than baseline
_md.4 is 4.5x faster than baseline  // Woo hoo!

Kem_Tekinay · October 15, 2024, 8:54pm

As I mentioned, the baseline in your project is slower than in the one I started with. Can you download my project and test it? I’m interested in comparing the number of rows per second.

Here, yours is:

Optimization_md_4.Process
Optimization_md_4.ProcessIntoDictionary
Parsing file 'input.txt'
Reading row 10000000
Parsed 10000000 rows in 4.2 seconds, rate = 2,364,934 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 4.2 seconds, rate = 2356642 rows/second
Speedup : 2.7x faster than Baseline

(I added the commas to your output manually.)

Mine is:

Optimization_KemTekinay_1.Process
Optimization_KemTekinay_1.ProcessIntoDictionary
Parsing file 'input.txt'
Reading row 10000000
Parsed 10000000 rows in 3.9 seconds, rate = 2,562,859 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 3.9 seconds, rate = 2561891 rows/second
Speedup : 1.9x faster than Baseline

Mike_D · October 15, 2024, 9:07pm

Kem, it looks like you may have edited the baseline? Comparing my Baseline and the one in your project, in several methods, I see most of the variables have been moved out of the for loop in your version, but in my version they are inside the for loop (which is probably slower).

Kem_Tekinay · October 15, 2024, 9:12pm

I did not, but I started with (I think) Martin’s project, and those are some of the improvements he mentioned.

Mike_D · October 16, 2024, 12:07am

So we may have a slight “baseline” disagreement of 10% or so, but I don’t think that’s a big deal.

Here’s my version 5 which is now giving me about a 5x speedup (Intel, Moderate compiler optimization):

Xojo1BRC_md_5.zip (20.3 KB)

Note: this version places the data file for the compiled app up one folder level (like _v1 did) - otherwise, each time you build, the IDE deletes your data and it needs to be regenerated, which is not fun for testing.

Also, the output file is appended to by each module that you run, so don’t be surprised if your output has the station names listed five times if you run all five modules.

v5 Results (Intel)

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.5 macOS Intel x64
----------------------------------------
Baseline.Process
                                           
Processed 10000000 rows in 24.7 seconds, rate = 405053 rows/second
----------------------------------------
Optimization_md_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 18.2 seconds, rate = 550613 rows/second
Speedup : 1.4x faster than Baseline
----------------------------------------
Optimization_md_2.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 14.5 seconds, rate = 688495 rows/second
Speedup : 1.7x faster than Baseline
----------------------------------------
Optimization_md_3.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 10.4 seconds, rate = 957865 rows/second
Speedup : 2.4x faster than Baseline
----------------------------------------
Optimization_md_4.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 5.2 seconds, rate = 1916958 rows/second
Speedup : 4.7x faster than Baseline
----------------------------------------
Optimization_md_5.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 5 seconds, rate = 2005811 rows/second
Speedup : 5.0x faster than Baseline
----------------------------------------

Testing on a M1 macbook air, the speedup is more modest (3.8x) but since ARM is faster overall, it still beats the Intel handily (2.8 M vs. 2.0 M rows/second)

v5 Results (ARM)

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.5 macOS ARM x64
----------------------------------------
Baseline.CreateMeasurements
Creating Data File
Creating 10000000 measurements for 413 stations
Data file = input.txt

Created 10000000 rows in 112.7 seconds, rate = 88729 rows/second
----------------------------------------
Baseline.Process
                                           
Processed 10000000 rows in 13.6 seconds, rate = 735915 rows/second
----------------------------------------
Optimization_md_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 10.6 seconds, rate = 940925 rows/second
Speedup : 1.3x faster than Baseline
----------------------------------------
Optimization_md_2.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 9.4 seconds, rate = 1060558 rows/second
Speedup : 1.4x faster than Baseline
----------------------------------------
Optimization_md_3.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 6.9 seconds, rate = 1448954 rows/second
Speedup : 2.0x faster than Baseline
----------------------------------------
Optimization_md_4.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 3.8 seconds, rate = 2607634 rows/second
Speedup : 3.5x faster than Baseline
----------------------------------------
Optimization_md_5.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 3.6 seconds, rate = 2768279 rows/second
Speedup : 3.8x faster than Baseline
----------------------------------------

Mike_D · October 16, 2024, 2:50am

I finally got Aggressive compilation to work (by doing an ARM-only build on my Intel mac) and the numbers are impressive. While baseline is only slightly faster with Aggressive compilation, the v5 optimization is running at 13x speed, over 10 million rows per second.

v5 Results (ARM, Aggressive)

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.5 macOS ARM x64
----------------------------------------
Baseline.Process
                                           
Processed 10000000 rows in 12.7 seconds, rate = 786493 rows/second
----------------------------------------
Optimization_md_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 10.2 seconds, rate = 982043 rows/second
Speedup : 1.2x faster than Baseline
----------------------------------------
Optimization_md_2.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 4.7 seconds, rate = 2148359 rows/second
Speedup : 2.7x faster than Baseline
----------------------------------------
Optimization_md_3.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 1.6 seconds, rate = 6146309 rows/second
Speedup : 7.8x faster than Baseline
----------------------------------------
Optimization_md_4.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 1 seconds, rate = 9745427 rows/second
Speedup : 12.4x faster than Baseline
----------------------------------------
Optimization_md_5.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 1 seconds, rate = 10193808 rows/second
Speedup : 13.0x faster than Baseline

And here’s Intel with aggressive - a bit slower overall, but the Aggressive + V5 speedup is even higher at 14.8x.

v5 Results (Intel x64, Aggressive)

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.5 macOS Intel x64
----------------------------------------
Baseline.Process
                                           
Processed 10000000 rows in 21.6 seconds, rate = 463761 rows/second
----------------------------------------
Optimization_md_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 17.4 seconds, rate = 573600 rows/second
Speedup : 1.2x faster than Baseline
----------------------------------------
Optimization_md_2.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 7.5 seconds, rate = 1331340 rows/second
Speedup : 2.9x faster than Baseline
----------------------------------------
Optimization_md_3.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 2.8 seconds, rate = 3587713 rows/second
Speedup : 7.7x faster than Baseline
----------------------------------------
Optimization_md_4.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 1.6 seconds, rate = 6399756 rows/second
Speedup : 13.8x faster than Baseline
----------------------------------------
Optimization_md_5.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 1.5 seconds, rate = 6869046 rows/second
Speedup : 14.8x faster than Baseline
----------------------------------------

Kem_Tekinay · October 16, 2024, 1:27pm

I haven’t tried but you could probably get even better results by getting rid of the If statement that checks to make sure you’re in a name when you hit a semicolon. If statements are expensive.

In Default optimization, your results are similar to mine. I am compiling yours as Aggressive now and will do mine next to compare. But a very nice job in any case.

I’ll use my code to do a multi-core version unless I find yours is significantly faster in Aggressive mode.

Kem_Tekinay · October 16, 2024, 3:18pm

I just tried various ways of creating a better lookup mechanism than using a Dictionary when starting with a MemoryBlock and could not find one. My best efforts took around 6 ms whereas using mb.StringValue + a Dictionary.Lookup/Value took 0 to 1 ms.

@Mike_D , have you tested your replacement Dictionary class against mb.StringValue + the native Dictionary?

Kem_Tekinay · October 16, 2024, 5:40pm

Bravo @Mike_D . With Aggressive, I can replicate your results here (10M/s), while my code tops out at around 3M/s.

Kem_Tekinay · October 18, 2024, 7:50pm

@Mike_D , using some of your ideas, I was able to get to almost 14M rows/sec. (Yours is about 12.3M rows/sec.)

I also cleaned up the code, making the Baseline faster using @Martin_T 's version, and added formatting codes to make the output more readable.

Results

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.5 macOS ARM x64
----------------------------------------
 verbose debugging enabled
Baseline.Process
Parsing file 'input.txt'
Reading row 10000000
Parsed 10,000,000 rows in 7.1 seconds, rate = 1,412,110 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations in 0.1 seconds, rate = 84,611,168 rows/second

Processed 10,000,000 rows in 7.2 seconds, rate = 1,388,420 rows/second
----------------------------------------
Optimization_KemTekinay_1.Process
Optimization_KemTekinay_1.ProcessIntoDictionary
Parsing file 'input.txt'
Reading row 10000000
Parsed 10,000,000 rows in 0.7 seconds, rate = 13,911,876 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations
Processed 10,000,000 rows in 0.7 seconds, rate = 13,868,548 rows/second
Speedup : 10.0x faster than Baseline
----------------------------------------
Optimization_md_5.Process
Optimization_md_5.ProcessIntoDictionary
Parsing file 'input.txt'
Reading row 10000000
Parsed 10,000,000 rows in 0.8 seconds, rate = 12,345,221 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations
Processed 10,000,000 rows in 0.8 seconds, rate = 12,121,126 rows/second
Speedup : 8.7x faster than Baseline

Have a look.

Xojo1BRC_md_5_Kem.xojo_binary_project.zip (23.9 KB)

The next step is multi-core.

Kem_Tekinay · October 24, 2024, 6:24pm

An update…

I converted my code to multi-core and was, unfortunately, very disappointed. No matter what I tried, it was slow. And I don’t mean slower than expected, I mean absurdly slow, like slower than the Baseline.

So I reached out to @William_Yu and filed this Issue. As it was explained to me, the current versions of Xojo use “synchronous locks” that are cross-platform but (very) expensive, so the results I saw are what you’d see in 2024r3.1.

But the good news is, in practically no time at all, William (and Travis) explored, identified, and fixed the issue by switching to “platform-specific locks”. (I don’t actually know the difference.) And that change, provided no problems come up in testing, will appear in a future release.

The result? My previous code was processing at about 14 M rows per second when compiled through 2024r3. I’ll let you be the judge of this.

Baseline.Process
Parsing file 'input.txt'
Reading row 10000000
Parsed 10,000,000 rows in 6.3 seconds, rate = 1,578,969 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations in 0.1 seconds, rate = 83,420,693 rows/second

Processed 10,000,000 rows in 6.5 seconds, rate = 1,549,494 rows/second
----------------------------------------
Optimization_KemTekinay_1.Process
Optimization_KemTekinay_1.ProcessIntoDictionary
Parsing file 'input.txt'
Reading row 10000000
Parsed 10,000,000 rows in 0.6 seconds, rate = 16,423,869 rows/second
Avg per chunk: 42.5 ms
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations
Processed 10,000,000 rows in 0.6 seconds, rate = 16,360,311 rows/second
Speedup : 10.6x faster than Baseline
----------------------------------------
Optimization_KemTekinay_MT.Process
Optimization_KemTekinay_MT.ProcessIntoDictionary
Parsing file 'input.txt'
Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 43.0 ms
Reading row 729638Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 44.4 ms
Reading row 1459081Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 42.4 ms
Reading row 2188758Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 42.6 ms
Reading row 2918201Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 45.8 ms
Reading row 3647526Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 44.0 ms
Reading row 4377034Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 41.3 ms
Reading row 5106755Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 49.2 ms
Reading row 5836465Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 30.0 ms
Reading row 6352286Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 42.0 ms
Reading row 7081819Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 42.0 ms
Reading row 7811335Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 42.9 ms
Reading row 8541130Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 44.4 ms
Reading row 9270562Oct 24 12:05:40  Xojo1BRC[63660] <Warning>: Process chunk: 44.3 ms
Reading row 10000000
Parsed 10,000,000 rows in 0.1 seconds, rate = 94,795,752 rows/second
Calculating 413 stations
Output file = 'output.txt'
Calculated 10,000,000 rows from 413 stations
Processed 10,000,000 rows in 0.1 seconds, rate = 93,441,352 rows/second
Speedup : 60.3x faster than Baseline
----------------------------------------

And before anyone asks, no, I don’t have access to a Xojo version used to compile this project so I cannot do any further testing at this time.

But boy am I excited about it.

Ulrich_Bogun · October 25, 2024, 12:18am

If that rate would transfer linearly, Xojo could make it under the top ten of the current list …
I’d love to see the real numbers!

MarkusR · October 25, 2024, 5:30am

xojo can process 1000000000 rows below 13 seconds?