How to populate a dictionary as fast as possible from an array

Hi all,
I’m using an array of … 1 000 000 000 (yes, one billion!) rows of booleans. When full, I have to populate a dictionary with all rows containing a « true » value, with dictionary index as key and array index (row number) as value.
What would be the fastest method, other than iterating each row of the array and testing for true values?
Thanks for any help or advice.

Can you start with the Dictionary in the first place?

And, along those lines, create a class that acts like an array through Operator_Index that uses a Dictionary on the backend to track the True values. At the end, you can extract its Dictionary.

1 Like

Thanks Kem,
I get the idea but have no experience with Operator_. And I found nothing about Operator_Index in LR nor online documentation.
Would you be so kind as to send me an example or something as a reference?

My bad, it’s Operator_Subscript. Give me a minute…

A simple proof-of-concept project:

1 Like

Thanks a lot, Kem.
I will have a look at your project and let you know.

Curious how this turned out.

Hi Kem,
Again, thanks for having spend time on this.
Unfortunately, your code isn’t faster than mine.
Here is the problem:
Until 100 000 000 (100 millions) of rows in the array, populating a dictionary needs a raisonable time (from some ms to 2 minutes, depending on the number of rows).
But with 1 000 000 000 (one billion), time increases drastically, reaching 3 hours and more!!!
Increasing BinCount to 100 000 000, I got 2 hours, it’s my best result currently.

I also tried to populate directly the dictionary instead of an array, but:

  • It’s 10x slower,
  • Removing some unuseful entries in the dictionary endly puts entries in disorder, while I need to keep the dictionary in order.
    I will now see what happens when exporting the array first to a text file (or MemoryBlock) with join, and then populating the dictionary with NthField and pairs, but I’m quite sure it will be slower…

FYI, I have a Mac running Mac OS X 10.13.6, 16 GB of RAM, a SSD disk.
After populating a dictionary from a one billion rows array, RAM dedicated to the App reaches … 9GB !

Can you try in-memory sqliteDatabase ?

I guess the next question is, why do you need the Dictionary? Once you have the array, what purpose does it serve?

(This sounds like a challenge, but I’m genuinely trying to understand. There might be an alternate way to get to your goal.)

When full, I have to populate a dictionary with all rows containing a « true » value, with dictionary index as key and array index (row number) as value.

But do you have to populate a dictionary?
Do you even need the array?

The thing is, if this was stored in a database, you can store these values in a table, and retrieve them when you want.
The key is indexed, so retrieval is very fast.

No conversion between array and dictionary required.
Scalable.
Persistent

So you can use an array of 1 billion rows of booleans ? Then you are pretty close to the limit of an array i’d say:

All arrays are indexed starting at position 0. Arrays can have a maximum index value of 2,147,483,646.

Better use something that’s much more doable and high-speed or more efficient.

If booleans are stored as 64 bit integers internlly then each boolean requires 8 bytes minimum for the dictionary value. A dictionary also has the key, which is probably another 8 bytes minimum. Plus, a dictionary has overhead for the hash table, which is likely another 8 bytes minimum per entry. So for a billion items you are probably going to need at least 24GB of RAM.

1 Like

Ok, I see I have to give you more details.
Yes, Kem, you’re right, it’s a challenge! :grin:
The goal is to calculate Prime numbers, as fast and as much as possible.
I’ve coded a Sieve of Eratosthene, using an array of false Boolean, and I set up the boolean to True when finding a Prime. This is very fast, i.e.:

  • 1 000 000 rows give 78 498 Primes in 0,05333 to 0,56 second,
  • 10 000 000 rows → 664 579 Primes in 0,5977 à 0,6124 second
  • 100 000 000 rows → 5 761 455 Primes in 6,51 seconds and
  • 1 billion rows → 50 847 534 Primes in 66,11 seconds
    But then, searching a Prime in the array is very slow, because the Prime is not the value of a row, but the index. So I have to loop until finding the xth Prime, ignoring rows with False values.

I thought I should use a dictionary. This works fine, but between 100 million and 1 billion, this becomes drastically slow.

Now, is there a way to accelerate the search in the array ? Or another way to store Primes? I would prefer not to use DB, because I’m not familiar with SQL.

I didn’t try to work in a array with Prime as value, because I would have to store this value and I’m afraid it would be time and Memory consuming. Your opinion?

My best advice is to get familiar with sql, generate the database once, then distribute it with your app.

1 Like

Please show us your code for searching the array, I bet we can speed it up.

Or, what about creating an integer array that stores just the prime numbers? Finding the xth prime would be a matter of simple math from any starting point.

Or, Could you add the number to your dictionary at the point you set the number to true in the array?