I want to store a large number of small amounts of data in a single file, ideally I’d use a SQLDatabase, but it’s not possible at this time.
So I was thinking I’d like to just write all this data to disk, with a header that works as a look up table of some sort, so it doesn’t have to trawl through the data to find the matching key and return the value.
So I’m thinking to create a hash table, but I’m not sure that’s very space efficient. Any other suggestions?
I have a feeling your requirements won’t allow this, but a simple JSON struct is probably easiest. Xojo.ParseJSON can handle huge quantities of JSON in milliseconds. Then you just have a regular Xojo dictionary to deal with.
Read Only (stored within the Resources folder of an application).
The data will increase over time as the application is updated, but the user cannot modify it.
Ideally I’d use a Xojo SQLDatabase as that does what I want, but last time I checked these couldn’t work from the application resources folder (i.e. I couldn’t find a way to open it as read only). I don’t want to go down the rabbit hole of copying it to the users folder, as I have to choose between having one db for all versions of the application or potentially wasting user’s disk space by having a db per version of the product.
@Thom_McGrath Thanks Thom,
At this point I don’t really know how much data is going to be stored, and it may only be used once in the application lifecycle, so I am concerned that a the advantages of using the dictionary hash table for fast searching, is nullified by having to load all that data in and convert it to a dictionary to start with.
@Kem_Tekinay Thanks Kem,
The contents are not really a secret, it’s basically a search index, and while I’d like to use SQLDatabase, last time I tried I couldn’t because the Xojo SQLDatabase opens for read and write and would fail when stored within the Resources folder.
I have additional concerns with copying the database to the users library.
The searches are basically to determine which file to load, so for the first version I could actually utilize the file system, if there’s a single match, use a symbolic link (as the data will be in a relative location to the index), if there’s multiple matches use a CSV file with the relative paths to the matching pages.
It’s a bit messy, but should (in theory) give me reasonable performance.
Edit: I understand in principle how hash tables work, convert the key to index, use that index to look up a ptr to the actual data. It’s the hashing of the key that I have trouble seeing, in a way that I can see working with a file on disk that is NOT massive and wasting space.
What will be your key? And Int, few bytes, a string? In other words, can it be fixed size or will vary?
The idea that I can prematurely expose here, because I know you can understand it and expand it by yourself, is creating a “resource compiler”. Something that will receive your inputs (records) + keys, and build 2 files. One data file and one index file.
Your data file will have just every record one after another, before writing it there you’ll take note of the offset and size of such record, to be written into your index file.
You will create a temporary SQLite DB. All fields of fixed size, your key, data offset in the data file, data record size… While you build your data file, you add one entry here.
Once you end it make a select by the key order and dump those fields into the index file. Every matching field, every record, must have the same size in this index file.
Your resource indexed file is ready (data + index). Delete the temp SQLITE db.
Make a binary search algorithm to fetch one index record by the key field in the index file (I believe you know the concept, if not, ask me)
If found, you now have the offset and size of the record you want. Go to the data file, seek the position and read those n bytes. Use them as intended.
I hope I was clear enough.
OK, since I apparently missed the need to avoid reading into memory, finding the thing, and discarding the dictionary, what about this:
I assume you can define something unique to identify ‘the data’
(This is what I might have expected to be the key for a dictionary)
So why not just have a sub folder in resources, holding as many small files as you need, named for this key?
Because it seems the problem mostly stems from having lots of small packets of data in a single large file?