optimization, array vs memory block access?

This is one of those things that really won’t make that big a difference probably and I’m solving a problem that doesn’t really exist since it works perfectly fine with arrays, but I’m looking at optimizations now that the base code is working very reliably.

I’ve got a “database” which is just a disk file of 2 8 byte records, a sequence of dates and values. Queries are only ever by the date entry (totalseconds stored as a double) They are always in order numerically by date so using an SQL database is huge overkill and for years now I’ve just been managing it by making sure that you don’t write date values out of order and keeping an index that gives the offset into the file by date with a new index point every 500 entries in the file. Then I walk the array of index points to find the closest index into the file and then walk the actual data to find the specific date i’m looking for to start the data read.

At the moment there is no caching of the actual data in memory or any such thing, only the index is stored in memory and all reading of the actual data is done from the disk file. As peoples data has grown though there are starting to be thousands of index points too so there is room for cleaning things up there.

Right now I load the index into an array of doubles and just walk it to find the nearest date value. The most obvious performance increase for both the index searching and finding the actual value on disk is to not walk them in order but to start in the middle of the range and compare and then divide the half that it’s within in half and so forth until I find it. I’m not sure what kind of search that is called… and I haven’t implemented such a thing in a very long time but I’m going to do that expecting that will give me the greatest increase in speed.

However it occurs to me that it might be faster to simply hold the index points in a single memory block rather than an array. Has anyone already done testing of array access vs memory block access? Would it be significantly faster to access the memory block than it would be the array? I will run my own tests after I’ve got the search working, but would love to know what others have discovered.

I’m also slightly concerned about using that search algorithm against the data on disk as I would be bouncing the .position of the file around to do that and I’m not sure what that will do to the build in file caching. I plan to start some kind of internal caching of the actual data as well but thats more complicated given how it’s accessed elsewhere. The low hanging fruit of first optimization is to fix the searches to run faster.

Binary chop I think…
Only works if the thing you are searching is ordered in advance.
Typically reduces the search to single-digit reads (ie less than 10 goes to find the closest record)

Is there only one record per date?
If so, and you want to avoid a database , then you could put the whole file into a dictionary and use Xojo’s built in algorithms.
Probably by reading each record in as a class on the way


[code]Dim d As New Dictionary
d.Value(thedate) = therecord
d.Value(anotherdate) = anotherrecord

MsgBox d.value(somedate)

But I do recommend that you switch to a database.
It will be relatively painless, totally upscaleable, and doing a search by date will be very fast anyway, if you index by date.
Faster again if the database is in-memory


A binary search, is how I’ve heard it. And they are very fast.

I have to agree with Jeff that what you’re doing practically screams “database”. However I have used MemoryBlocks over arrays with great effect as long as I’ve used Ptr to access to the data. A Dictionary might be the fastest though, and certainly faster than a binary search (I tested that a few years ago.)

But really the question is whether you can load all the values into memory. If the answer is “yes”, I’d store them all in a Dictionary, perhaps with a record class as Jeff suggested. If no, then I’d use an on-disk SQLite database.

Binary searches on disk require hits to the disk to figure out whether your current value is > or < the one your looking for
On an SSD this wont be an issue but on an HD this is orders of magnitude slower

I’d suggest

  1. cache the whole thing in memory and get rid of the hits to disk
  2. a database

Not sure this is the best solution, but an option is to create an in-memory SQLite database, if you do not wish to go to a full database solution.

The index I do cache in memory since thats relatively small, but the data has to not live entirely in the computer. I added some logging to the routines to see how much the different parts take, though I have this on a fairly recent mini with a very nice hard drive in it the maximum it’s taking for any index search function is 7 milliseconds. So thats really not all that bad. Thats on a file with over 20 million records in it and 41k index nodes with a full data file size of 320 meg. I’m not sure I could do that well with an SQL database without a lot of tweaking and even less good memory management. Though I wouldn’t be the one having to worry about the memory management :wink: Most smaller files don’t even measure a single millisecond to find the proper spot to start reading the data. I’m going to concentrate on a disk caching scheme first as thats going to be a better impact for the work. i’ll have to abstract away the stream so that I can load in chunks rather than just read directly from disk there. I think I can do something helpful with that.

May I suggest that you create a database first and benchmark against that?

yes, that is reasonable, since really, I have no idea of the performance of such a thing compared to this. I need to get over my Oracle PLSql baggage from an old job and embrace the more recent technology :wink: i actually removed the database from this program 5 or 6 years ago in preference of a flat file since it was adding significant overhead and the pipe for getting the data back in was slow for some reason. A lot has changed in that time, I should experiment.

My original actual question though remains, is it faster to walk an array comparing the values? Or to treat a memory block like a stream and read through it looking for the appropriate value?

try a binary search on an array ?
if they’re ordered should be darned fast

but seriously … oracle ? slow ?
yes it could be set up to be slow
i’ve worked on some oracle installs that would blow yer mind with how fast they are even though they are many many terabytes in size

Could have been slow if the method was to walk through the records instead of using select where I guess

Never sort or loop through recordsets, let the database do these jobs for you. Converting recordsets into arrays and sorting them sounds quite good to me. Bad data-design and normalization cannot be fixed by any software.

Any SQL database would be orders of magnitude faster than that. Not to mention much simpler to code.