Xojo1BRC (Xojo One Billion Row Challenge)

Mike_D · October 13, 2024, 4:37pm

Now that 2024R3 is out and supports Preemptive threading, I thought it might be fun to try our skill at the 1BRC (One Billion Row Challenge).

From the official 1BRC.Dev site

Your mission, should you choose to accept it, is to write a program that retrieves temperature measurement values from a text file and calculates the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows! That’s more than 10 GB of data!

The text file has a simple structure with one measurement value per row:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
Hamburg;34.2
St. John's;15.2
Cracow;12.6
... etc. ...
The program should print out the min, mean, and max values per station, alphabetically ordered. The format that is expected varies slightly from language to language, but the following example shows the expected output for the first three stations:
Hamburg;12.0;23.1;34.2
Bulawayo;8.9;22.1;35.2
Palembang;38.8;39.9;41.0
Oh, and this input.txt is different for each submission since it’s generated on-demand. So no hard-coding the results!

Other Notes

no plugins, libraries, or DLLs - let’s use pure Xojo code only
if you use multiple CPU cores, limit of 8 cores
Input value ranges are as follows:
- Station name: non null UTF-8 string of min length 1 character and max length 100 bytes (i.e. this could be 100 one-byte characters, or 50 two-byte characters, etc.)
- Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
There is a maximum of 10,000 unique station names (but note that the reference data only has 413 stations.)

Baseline

Here’s my Xojo project which can be a starting point:

It’s a console app
it’s purely single-threaded Xojo code
if the input.txt data file is missing, it creates it. Note: It only creates 10 000 000 (10 Million) rows for now for practical reasons.
it does a Baseline benchmark. This algorithm is how a normal person would solve the problem (using a TextInputStream, UTF8 Strings, Dictionaries, Arrays, String.ToDouble conversion, etc.)
it then does several Optimized benchmarks and compares results

On my Core i9 Intel Mac here are my results:

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.0 macOS Intel x64
----------------------------------------
----------------------------------------
Baseline.CreateMeasurements
Creating Data File
----------------------------------------
Creating 10000000 measurements for 413 stations
Data file = input.txt

Write row 10000000 (100%)Created 10000000 rows in 239.7 seconds, rate = 41721 rows/second
----------------------------------------
Baseline.Process
Reading row 10000000                                           

Processed 10000000 rows in 22.6 seconds, rate = 443388 rows/second
----------------------------------------
Optimization_md_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 17.9 seconds, rate = 560132 rows/second
Speedup : 1.3x faster than Baseline
----------------------------------------
Optimization_md_2.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 14.4 seconds, rate = 695441 rows/second
Speedup : 1.6x faster than Baseline
----------------------------------------
Optimization_md_3.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 10.2 seconds, rate = 978989 rows/second
Speedup : 2.2x faster than Baseline
----------------------------------------
Optimization_md_4.Process
                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 5.2 seconds, rate = 1909900 rows/second
Speedup : 4.3x faster than Baseline
----------------------------------------

The baseline does about 400k rows per second, but as of my 4th optimization, I’m getting just under 2 million rows per second per CPU core.

How fast is that?

The winner of the original 1BRC (written in Java) gets about 80 million rows per second per CPU core.

So we have about 40x speedup left to go!

Here’s the baseline version of the project.

By request, the link to this version of the project has been removed. Please work back from the end of this conversation to get the latest version.

It includes a module named Optimization_yourname_1 which you can modify to test out your own optimizations.

Martin_T · October 13, 2024, 7:21pm

This looks like a nice project worth playing with. Here is my initial version, which has already been optimized a little. It uses

Str() instead of .ToString
CLong() instead of .ToInteger
Format() instead of .ToString
Val() instead of .ToDouble
Variables outside of loops

Here are my results for macOS ARM x64:

----------------------------------------
Xojo 1BRC (One Billion Row Challenge)
Xojo1BRC version 1.0.0.0 macOS ARM x64
----------------------------------------
----------------------------------------
Baseline.CreateMeasurements
Creating Data File
----------------------------------------
Creating 10000000 measurements for 413 stations
Data file = input.txt

Write row 10000000 (100%)Created 10000000 rows in 11,5 seconds, rate = 872236 rows/second
----------------------------------------
Baseline.Process
Reading row 10000000                                           

Processed 10000000 rows in 8,8 seconds, rate = 1135738 rows/second
----------------------------------------
Optimization_yourname_1.Process
Reading row 10000000                                           
Calculated 10000000 rows from 413 stations
Processed 10000000 rows in 6,2 seconds, rate = 1610281 rows/second
Speedup : 1,4x faster than Baseline
----------------------------------------

Xojo1BRC_baseline.xojo_binary_project.zip (11.6 KB)

Mike_D · October 13, 2024, 10:07pm

Those are interesting optimizations - most of them are using the older API1 functions, right? I too feel like some of the API2 calls are slower than API1 versions (though they may have better correctness).

Also interesting that creation of the data file is over 20x faster on your M series mac. I know the CPUs and SSDs are fast on those things, but I wonder if there’s some other Xojo framework issue with Intel builds?

Martin_T · October 13, 2024, 10:20pm

The API 2.0 methods are extension methods that most likely use the older API 1.0 methods under the hood. This results in overhead, which presumably leads to the loss of speed.

Rick_Araujo · October 14, 2024, 6:07am

More like wrappers, a call to another call. Xojo must work to optimize and eliminate such extra steps. But I’m almost convinced that there are more things, like ToString using a modern and complex ICU processing and Format() a simpler one, because they behave differently.

Mike_Linacre1 · October 14, 2024, 8:18am

Quote: “calculates the min, mean, and max temperature per weather station.”

If that is all we need, then there is no need to save values. We could build an array of weather stations ordered for a binary search, then each station only needs a few numbers: max so far, min so far, count so far, mean so far. (Recommend using “incremental averaging” ).

Doing this, reading the file would be the slowest part.

Or have I missed something?

MarkusR · October 14, 2024, 10:59am

someone already used a in memory sqlite database for this task with xojo?

rules said:

You’re limited to the standard library of your language.

Ian_Kennedy · October 14, 2024, 1:38pm

Keep Min, Max, Sum and Count. Then you only need to perform divisions once at the end of the process when reporting.

Mike_D · October 14, 2024, 2:20pm

Good points.

Here are some hints that dovetail…

keep the temperatureSum and count, and update min and max as you go. No need to store an array of temperatures. Calculate the mean in the final step.
using TextInputString and Strings is slow. I moved to MemoryBlocks, then eventually to Ptr (Pointers) which are the fastest Xojo can offer for byte twiddling
String to Double conversion is slow. Most solutions on 1BRC.dev end up using a custom function to do this. Because you are guaranteed the number format is simple (-XX.X through XX.X, always with one decimal) you can optimize this step highly.
Xojo Dictionary is slow - internally it uses Variants and case-insensitive string comparisons. I ended up writing a custom Dictionary class which uses a fast integer hash.
@Mike_Linacre1 idea of using a Binary tree is a good one. Would it be faster than my custom Dictionary class? TBD

the official rules say to do 5 runs and average the middle 3 results. My Baseline project runs the baseline processing first, so you can assume the file is already cached by the OS by the time your optimized code runs. Most of the time therefore should be processing time, not disk read time.

Jeremie_L · October 14, 2024, 2:32pm

Baseline calculation is actually wrong.
The minimum temperature for each station will always be 0.0

Verifying it is wrong:
If you change the CreateMeasurements method on line 46 to this:

temp = max(1.0, stationMean + ( r.Gaussian * stdDev ) )

The minimal temperature for each station will still be 0.0 in the output, instead of 1.0

Mike_D · October 14, 2024, 2:49pm

@Jeremie_L is right - the bug was not setting Min and Max to very high or low values before starting. They are now set to 999 and -999 which fixes the calculation.

Here’s baseline_v2 with these changes:

minor cosmetic cleanup of the progress lines when running
fixed min/max temperature calculation bug
outputs temperatures with leading zeros, e.g. -0.7 instead of -.7
debug run defaults to 100000 rows for faster testing
for built app, locate the data files in CWD not one level up

Xojo1BRC_baseline_2.xojo_binary_project.zip (11.7 KB)

Kem_Tekinay · October 15, 2024, 12:48pm

Through much trial and error, I got this to about 1.9x baseline. This is not something I recommend!

I use a MemoryBlock and Ptr to parse the original data and dispense with encodings.

(Based off of one of the earlier projects, Martin’s I think.)

Xojo1BRC_baseline - Kem.xojo_binary_project.zip (12.6 KB)

Ulrich_Bogun · October 15, 2024, 3:44pm

Uh. With aggressive optimisation, that compiled almost 40 minutes on my MacBook Pro M1. I was afraid it would unroll 10 million repetitions, but with a size of “only” 18 MB it doesn’t look that way. 4.1 seconds here, factor 2.1.

Mike_D · October 15, 2024, 3:45pm

Testing notes:

The project has Build Optimization Level set to Moderate. @Kem_Tekinay reported, and I confirm that on Aggressive it never seems to compile [Edit: it takes an hour+ to compile, and runs more slowly] Kem reported this: https://tracker.xojo.com/xojoinc/xojo/-/issues/77604
A quick way to optimize is to run the built app, open ActivityMonitor.app, and take a sample of the process. You’ll get a stack trace sample with the methods taking the longest at the top, as in this example:

Sample

Call graph:
    2330 Thread_10814897   DispatchQueue_1: com.apple.main-thread  (serial)
    + 2330 start  (in dyld) + 1909  [0x7ff80e6c2345]
    +   2330 main  (in Xojo1BRC) + 19  [0x10219b803]
    +     2330 _Main  (in Xojo1BRC) + 458  [0x10219be1a]
    +       2330 REALbasic._RuntimeRun  (in Xojo1BRC) + 19  [0x1020ce273]
    +         2330 RuntimeRun  (in rbframework.dylib) + 53  [0x1035382df]
    +           2330 CallFunctionWithExceptionHandling(void (*)())  (in rbframework.dylib) + 134  [0x1035a01b1]
    +             2330 ConsoleApplication._CallFunctionWithExceptionHandling%%o<ConsoleApplication>p  (in Xojo1BRC) + 184  [0x1020a1888]
    +               2330 CallConsoleApplicationRunEventHelper()  (in rbframework.dylib) + 325  [0x1034c3db8]
    +                 2330 App.Event_Run%i8%o<App>A1s  (in Xojo1BRC) + 9630  [0x10213dece]
    +                   2330 Baseline.Process%%o<FolderItem>o<FolderItem>  (in Xojo1BRC) + 226  [0x102186ad2]
    +                     675 Baseline.ProcessIntoDictionary%o<Dictionary>%o<FolderItem>  (in Xojo1BRC) + 1924  [0x102182314]
    +                     ! 670 String.$ToDouble%f8%s  (in Xojo1BRC) + 22  [0x101ddeff6]
    +                     ! : 662 String.$CDbl%f8%s  (in Xojo1BRC) + 22  [0x101ddd796]
    +                     ! : | 305 runCDbl  (in rbframework.dylib) + 284  [0x10359c06c]
    +                     ! : | + 305 replaceAll(string, string, string)  (in rbframework.dylib) + 78  [0x10359a480]
    +                     ! : | +   75 StringOpsClassic::ReplaceAllText(StringStorageBase*, StringStorageBase*, StringStorageBase*)  (in rbframework.dylib) + 932  [0x10360ddde]
    +                     ! : | +   ! 34 ???  (in rbframework.dylib)  load address 0x1032f6000 + 0x3138d5  [0x1036098d5]

From this you can see the hottest method is in the String.ToDouble() method, which appears to have a lot of string processing inside it (including a call to ReplaceAllText which perhaps should be reported as a separate bug?)

Ulrich_Bogun · October 15, 2024, 3:49pm

Thank you! I missed that. So yes, as written in the ticket: It does compile. On an Intel Mac you probably have to wait more than an hour.

Kem_Tekinay · October 15, 2024, 3:49pm

Moderate is supposed to reduce file size, not improve speed. At least, that’s what (I thought) an engineer told me.

Kem_Tekinay · October 15, 2024, 4:10pm

The next step is to use preemptive threads via ThreadPool.

My thinking is that each Thread will get a chunk of input that it will parse and store in its own Dictionary. It will return that Dictionary to the main thread which will reconcile it with a “master” Dictionary. This will keep each Thread going without blocking other Threads, and I don’t think the reconciliation process would take that long. Given the limit of using 8 cores, I don’t think this will result in an 8x speedup, but hopefully a 2-4x improvement.

Ulrich_Bogun · October 15, 2024, 4:19pm

You probably pondered that idea already too: If there were just one dictionary and all threads would work into it (waiting just at the dictionary part if necessary), how are the chances of running into a critical section considered there are over 400 stations?

EDIT: Forget it. You don’t only have to stop them if two threads are working on the same station of course, but for every manipulation of the dictionary …

Mike_D · October 15, 2024, 4:31pm

@Kem_Tekinay took the bait, yay!

If you are interested at peeking at my solution, here’s my v4 (which includes my v1, v2, and v3) optimizations. It gets about 4x speedup, using the following techniques:

almost no String use (instead, stores Ptr, Offset, Length)
custom string->double conversion (again, using Ptr)
custom Dictionary class which implements an “open addressing, linear probing” algorithm (Link) combined with a fast Unit32 string hash algorithm and binary string comparison (again, using Ptr and bytes, not strings).

Xojo1BRC_md_4.xojo_binary_project.zip (18.0 KB)

Ideas for further optimization:

preemptive threading
since the temperature values are all X.X decimals, avoid all use of Doubles. Instead, store as an integer x10, and only divide by 10 when doing the final calculations.

Kem_Tekinay · October 15, 2024, 4:33pm

Right.

My other idea was to use a flag to pull or store a Station in the master Dictionary, then use a flag built into a cStation to manipulate it. Something like this:

DictSemaphore.Signal

station = MasterDict.Lookup( name, nil )

if station is nil then
  station = new cStation
  station.Semaphore = new Semaphore
  MasterDict.Value( name ) = station
end if

DictSemaphore.Release

station.Semaphore.Signal

// station updates

station.Semaphore.Release

Maybe I’ll go that route first.