Using a hash table on a file?

Sam_Rowlands · November 26, 2021, 1:19am

Thank you @Rick_Araujo for your explanation.

The keys and data are strings of variable lengths.

In what limited time I had yesterday, I devised the following strategy, which seems to be similar to what I understand from your suggestion.

The datafile consists of the following sections.

Header, including values to convert UniqueID into an index.
LOOK table, a series of Ptrs (basically a LUT)
DATA table, where the keys and data are actually stored.

The process should flow… (ha!)

Convert the key into a unique number by using a custom hash.
Convert that number into an index, by scaling it to fit within 0 and numberOfKeys, multiplying it by ptrSize and appending the LOOK table offset.
Use the index to go straight to and read the dataPtr from the LOOK table.
Use the dataPtr to go straight to the correct position in the DATA table and read the data.

There will be collisions, so the collided data is stored in arrays along with the keys, which means it can step through to the correct key.

It is far from perfect, but it should give me what I want. I can see the first custom hash is probably where most of the time will spent in getting it right. I need it to be quick, and at the same time produce as less collisions as possible.

The hardest part is what do I call this?
Ohanaware Disk Indentured Dictionary?

Sam_Rowlands · November 26, 2021, 1:23am

That is a potential solution, especially for the moment as the end result is to know which file to load.

Rick_Araujo · November 26, 2021, 12:07pm

I tried to create the simplest possible version avoiding extra data structures or using hashes, for fast
implementation, and also avoiding extra data being stored. But using variable size keys, a hashed distributed index with those keys is one way to go, but maybe a b+tree could be better?

About the name, your call. But I dislike the words “Disk” and “Indentured”. I prefer combinations of stored, read-only, frozen, external, etc.

Kem_Tekinay · November 26, 2021, 12:45pm

Another idea if a SQLite database wouldn’t be too big: on startup, copy the database into a uniquely named temp file. On quit, delete it. If your app crashes, it will linger, but only until the next restart.

Sam_Rowlands · November 29, 2021, 1:47am

It’s working and I got some numbers, we all like numbers right?

With a limited data set, it’s taking about ~80 microseconds to locate and load the data on a i9 2019 16" MacBook Pro. This is way within in the acceptable range for this phase.

In comparison, a Xojo dictionary takes about 11 ms to look up and return the data. However with a Xojo dictionary, you need to already have de-serailize the data and loaded it into a dictionary.

In theory (providing my algorithm holds up), the search speed shouldn’t get slower as more data is added to the file (or as @Jeff_Tullin has pointed out, database).

Edit: For phase 2 of the testing, I imported a JSON file containing 105 keys (and their values), which is a live data set of what this application will be working with. Loading the JSON file took 1,579 ms.

Once converted to this format (and reading from disk), my format took around ~ 80 ms for find the record, and 110 ms to de-serialize the data.

So obviously some work is needed to improve de-serialization time.

As the same data is used in multiple places, I am considering adding a third table (which requires a 3rd hop), but allows the same data to used for multiple results.

Sam_Rowlands · December 3, 2021, 8:31am

If anyone is interested is seeing this in action… It is pretty boring…

Download App Wrapper 4.4 Alpha 3 from App Wrapper 4 - Prereleases Open the Help, and use the search field in the help toolbar.

Matthew_Combatti · December 6, 2021, 3:48am

For rapid data searching with little loss to performance, possibly a Trie? (Data/GraphQL)?

See “USE” section:

These type of database-structures are fun to design, and very rewarding when you are used to relational database usage. Very complex databases can be structured and used concurrently (both reading to, and writing to) with no issue and without using joins, plus they’re much easier to scale…and even, make decentralized/distributed.

Check it out