Using a hash table on a file?

Sam_Rowlands · November 24, 2021, 9:50am

I want to store a large number of small amounts of data in a single file, ideally I’d use a SQLDatabase, but it’s not possible at this time.

So I was thinking I’d like to just write all this data to disk, with a header that works as a look up table of some sort, so it doesn’t have to trawl through the data to find the matching key and return the value.

So I’m thinking to create a hash table, but I’m not sure that’s very space efficient. Any other suggestions?

Rick_Araujo · November 24, 2021, 3:43pm

Record of fixed or variable size? Read only or read/write? Small or large or can it increase over time?
Basically I want to determine what can be done in memory before going to… hard disk?

Thom_McGrath · November 24, 2021, 3:47pm

I have a feeling your requirements won’t allow this, but a simple JSON struct is probably easiest. Xojo.ParseJSON can handle huge quantities of JSON in milliseconds. Then you just have a regular Xojo dictionary to deal with.

Sam_Rowlands · November 25, 2021, 12:27am

Variable Size.
Read Only (stored within the Resources folder of an application).
The data will increase over time as the application is updated, but the user cannot modify it.

Ideally I’d use a Xojo SQLDatabase as that does what I want, but last time I checked these couldn’t work from the application resources folder (i.e. I couldn’t find a way to open it as read only). I don’t want to go down the rabbit hole of copying it to the users folder, as I have to choose between having one db for all versions of the application or potentially wasting user’s disk space by having a db per version of the product.

Kem_Tekinay · November 25, 2021, 12:40am

You could encrypt a SQLiteDatabase with a password that’s obfuscated within the code, if you’re just trying to prevent casual access.

Sam_Rowlands · November 25, 2021, 1:05am

@Thom_McGrath Thanks Thom,
At this point I don’t really know how much data is going to be stored, and it may only be used once in the application lifecycle, so I am concerned that a the advantages of using the dictionary hash table for fast searching, is nullified by having to load all that data in and convert it to a dictionary to start with.

@Kem_Tekinay Thanks Kem,
The contents are not really a secret, it’s basically a search index, and while I’d like to use SQLDatabase, last time I tried I couldn’t because the Xojo SQLDatabase opens for read and write and would fail when stored within the Resources folder.
I have additional concerns with copying the database to the users library.

Sam_Rowlands · November 25, 2021, 1:23am

The searches are basically to determine which file to load, so for the first version I could actually utilize the file system, if there’s a single match, use a symbolic link (as the data will be in a relative location to the index), if there’s multiple matches use a CSV file with the relative paths to the matching pages.

It’s a bit messy, but should (in theory) give me reasonable performance.

Edit: I understand in principle how hash tables work, convert the key to index, use that index to look up a ptr to the actual data. It’s the hashing of the key that I have trouble seeing, in a way that I can see working with a file on disk that is NOT massive and wasting space.

Kem_Tekinay · November 25, 2021, 1:42am

What about copying the database into a folder for your app in the /Users/Shared folder?

Sam_Rowlands · November 25, 2021, 2:06am

As mentioned before, I’d like to avoid this route as it means I have to choose between a bunch of evils, that I can currently see.

Multiple copies of the index for different versions of the application taking up users space (as I don’t know when the user upgrades or removes a version from their disk).
A single copy that ends up getting used across multiple versions of the application.
Copy it every single time so it should be correct of users run multiple versions concurrently.

Kem_Tekinay · November 25, 2021, 2:36am

Unless you’re talking about gigabytes, I don’t think ous worry about the space.

Sam_Rowlands · November 25, 2021, 2:52am

Anyone else looking here, there’s a Xojo Feedback which could mean that I should be able to use a SQLiteDatabase from within the Resources folder of an application. <https://xojo.com/issue/21814>

Beatrix_Willius · November 25, 2021, 4:35am

How about you use a text file in the resource folder to make an in memory database?

Jeff_Tullin · November 25, 2021, 7:44am

A JSON file sounds like the way to go.
Even a CSV file holding pairs will suffice.
Describing CSV below, but much of it applies to a JSOn file too

a text file holds:

Key, Value
Key, Value

etc

If it is to go in the resources folder, you can just drag the text file into the project.
At runtime, it is a string constant.

Split that on endoflines to get an array of strings
iterate through the array, creating a dictionary or pairs by splitting on comma.
Then just get the value for the key when you need it.

Rick_Araujo · November 25, 2021, 1:05pm

He already gave reasons for not going this way. He doesn’t want to load everything into the memory, just read the parts he wants.

Rick_Araujo · November 25, 2021, 1:09pm

What will be your key? And Int, few bytes, a string? In other words, can it be fixed size or will vary?

The idea that I can prematurely expose here, because I know you can understand it and expand it by yourself, is creating a “resource compiler”. Something that will receive your inputs (records) + keys, and build 2 files. One data file and one index file.
Your data file will have just every record one after another, before writing it there you’ll take note of the offset and size of such record, to be written into your index file.
You will create a temporary SQLite DB. All fields of fixed size, your key, data offset in the data file, data record size… While you build your data file, you add one entry here.
Once you end it make a select by the key order and dump those fields into the index file. Every matching field, every record, must have the same size in this index file.
Your resource indexed file is ready (data + index). Delete the temp SQLITE db.
Make a binary search algorithm to fetch one index record by the key field in the index file (I believe you know the concept, if not, ask me)
If found, you now have the offset and size of the record you want. Go to the data file, seek the position and read those n bytes. Use them as intended.
I hope I was clear enough.

TimStreater · November 25, 2021, 2:46pm

What we’re lacking is a way to pass a parameter string to the sqlite3_open_v2() presumably done inside the .connect(). See:

https://sqlite.org/c3ref/open.html#coreuriqueryparameters

and:

https://www.sqlite.org/uri.html

Jeff_Tullin · November 25, 2021, 3:01pm

OK, since I apparently missed the need to avoid reading into memory, finding the thing, and discarding the dictionary, what about this:

I assume you can define something unique to identify ‘the data’
(This is what I might have expected to be the key for a dictionary)

So why not just have a sub folder in resources, holding as many small files as you need, named for this key?
Because it seems the problem mostly stems from having lots of small packets of data in a single large file?

kevin_g · November 25, 2021, 4:28pm

I logged that one earlier this year:
<https://xojo.com/issue/64633>

TimStreater · November 25, 2021, 4:38pm

Ah, thanks for that. I just gave it some points.

Rick_Araujo · November 25, 2021, 4:47pm

SQLite read only request is being ignored for almost 10 years, requested when SQLITE was branded by another name, 2012.

<https://xojo.com/issue/21814>

All they need here is a new property, SQLiteDatabase.ReadOnly [ = False ]

That the user could set to True before the current SQLiteDatabase.Connect() [ As Boolean ]

And a new logic in Connect() would take place.

Still backwards compatible.

–

I’ve add this info there now.