Opinion on dictionary vs array

Doug_Craig · January 10, 2022, 6:56pm

Hi everyone!

I’m looking at updating a crm and order management application that was designed for N number of orders years ago (perhaps ten) but is now handling N x 50 or more orders - so saw lots of performance issues during 2021, etc.

These “orders” are physically produced but within the context of the application they are custom objects. Currently these objects are loaded into a massive array from the database engines; sometimes as many as a million order objects loaded into memory. There is a single array that holds all these, and other “customer” objects reference back to this array (so a customer object references that customer’s orders easily). However for any operation that requires searching through this array, the time is slowly increasing. Additionally the application real time (or as close to it) updates objects in memory with new orders, order status changes, etc, which can come at the rate of a hundred per second, so constantly searching this array to locate the order and apply changes to it can add up.

I was hoping to ping the community to get an opinion of speeding up accessing these orders in memory, as the projection is at least another 25%+ increase in order volume for 2022.

My first thought was to load the order objects into a dictionary, with a key being the database identity of the order. It seems that then order lookup in the dictionary would be much faster, as a dictionary of one million items would be quicker to locate the one that has database identity 123456789 by checking hasKey(identity) and value(identity). The customer object would reference the objects either in an array or the same type of dictionary just containing references back to the main order dictionary.

However, another hitch is that in various places throughout the code base, orders are searched for based not only on the db identity but other attributes such as the internal order id string and version integer or even with the customer’s order id. (for example an order may be for customer 012345 database identity 123456789 order id R789654 ver 2 but the customer’s own id for the order is ABCDEFG.)

Would it be wise to have multiple concurrent dictionaries with references to the primary one? For example, a dictionary whose key is the db identity (as noted above), and second dictionary whose key is the orderid-ver with a reference back to the original dictionary?

As the goal is to simplify and speed up operation, and would require a lot of code changes to move away from the current array based objects, I want to be sure I am on the right path before committing the time to make these changes. So I am hopeful the community can offer their opinions on the things I outlined above. Thank you in advance for any help!

MarkusR · January 10, 2022, 7:08pm

your search operation really requires all this orders, you can not split into
old,new,open,closed,year?

Wayne_Golding · January 10, 2022, 7:18pm

Have you considered an in-memory SQLite database?

Doug_Craig · January 10, 2022, 7:20pm

“your search operation really requires all this orders, you can not split into
old,new,open,closed,year?”

Correct - the application has user settings to load “x” number of days of orders into memory; even set to 30 days is still a lot (Nov of 2021 would be roughly 200,000 orders).

In terms of further breaking down orders, there’s currently 162 different statuses an order can have (anywhere from ‘received’, ‘print’, ‘cs issues’, ‘staging’, ‘packaged’, ‘shipped’, ‘charged’ etc) in addition to over 100 more properties (though only a few are ‘searchable’). With the confines of the application, searching goes along with massive filtering so that any user in any of multiple departments can view orders via parameters entered (or presaved on existing filters) - so either a fuzzy search in a search bar to match. For example a Department could be viewing a filter of orders that are relevant to them at the time (a filter showing orders Due Tomorrow that are certain statuses) which narrows it down to 2500 orders and then they search in the search bar for “ABC” it would then show only orders whose id, customer order id, event name, or one of many of the other properties regex that string.

Doug_Craig · January 10, 2022, 7:27pm

Have you considered an in-memory SQLite database?

I have not; the application grew from a relatively light frontend of a MS SQL Server database into it’s own beast of order and customer management. Would you recommend an in memory local instance of a db over using dictionaries as a replacement for the massive arrays of objects?
I’m having trouble thinking of how such a replacement could work as the order classes have over 100 properties apiece and at least as many methods.

Wayne_Golding · January 10, 2022, 7:31pm

Why do you not leave the data in the database & request it as necessary? An in-memory SQLite database would remove the disk IO that I think you’re avoiding, but with SSD drives the difference would be minimal.

Doug_Craig · January 10, 2022, 7:45pm

I think would be too slow - the application has been moved over the last few years to perform less database hits as the MS SQL Server (while sitting on a 16 core 64 gb VM) simply can’t handle the quantity of requests coming from the quantity of workstations in a speedy manner, especially during the end of day crunch times as thousands of orders are being prepped for USPS and UPS pickups. While the VM infrastructure is getting an update later this year, I’m working towards speed optimizations at the local application level to be done in the next couple of weeks.

With that in mind the one metric I’ve noted for improvement has been parsing through these large arrays - with the aspect of fuzzy searching being only one part of the application operations. I just didn’t want to start updating the code base to use dictionaries if it would be a fool’s errand.

In terms of the SQLLite idea you would recommend that upon launch the application merely copies from the SQL Server the data needed and performs in memory lookups and occasional hits to the SQL Server for newer changes?

Matthew_Combatti · January 10, 2022, 8:11pm

Typically in any large user-centric service, there is not just one database, rather reading databases and writing databases, all of which are able to sync to maintain data across all databases/nodes to stay consistent (Just about any large website/service you use online, follows this methodology) - Do some searching for “database cloning & synchronization.” If you have users reading and writing to and from the same database simultaneously, there will be ‘hang-time.’ The larger the database, the longer queries will take as well (n+1 database time query problem).

**You can also ‘paginate’ data requests to reduce the number of entries being returned and speed up queries tremendously, since a user cannot view a page with a million orders on it (that would be silly), and allow the user to go page by page of the orders doing what is necessary? Simply count the number of orders to let the user know how many there are (and so you know how many pages will be involved in the pagination with options for maybe next page/previous page/goto page), limit the query to return say the first 300 orders on page 1, and page 2 will show 301 thru 600, so on and so forth. Then the pages are more ‘print-friendly’ if needed also (printing a few pages vs printing 10000 pages).

Jose_Fernandez_del_Valle · January 10, 2022, 8:44pm

I recommend creating a new project with a single simple screen and two buttons. One button would run arrays the other would run dictionaries.

Having the database, doing the above is easy. It allows you to measure time.

You can even reduce time, changing the techniques to perform the queries and depending less on an array or dictionary. MS SQL Server allows you to do the above.

Doug_Craig · January 10, 2022, 8:50pm

Thank you; this is close to my request for opinion on array vs dictionary speed when the application looks to find the relevant objects.

I suppose I will try this on the idea of multiple dictionaries referencing the single object.

Tim_Hare · January 10, 2022, 11:27pm

Take a good long look at the kinds of searches your users are doing and make sure your database is properly indexed to handle them. The database service is optimized for indexed queries. Try to avoid full database searches at all costs.

Then properly paginate the query results. Never return the whole result set at once. For one thing, nobody’s ever going to scroll through that many records anyway.

Present the user with a list of results and only return the fields being displayed. Read the entire field set only when they select a record to interact with.

Instead of trying to optimize with dictionary vs. array, optimize the way you use the database. The database makers are a lot smarter than you in terms of optimization. Capitalize on their experience.

Eric_Williams · January 11, 2022, 5:17am

Your original post and the follow-ups gave me the shivers.

Your application’s data storage is absolutely BEGGING to be reimplemented as a database. Everything you’ve described – searching, indexing, making changes, etc. – are the primary reasons we use databases. They are optimized to make those functions fast.

I accept that your SQL situation isn’t ideal; this could be due to a litany of issues both in configuration and usage pattern. However, it’s unlikely that your use case is approaching the limits of any database server.

My suggestion: stop working on the object optimization and start working on a performant database solution. Hire someone in to consult if necessary, but you’ve got to start the process of moving away from what you’ve got now. You might be able to start your conversion by having your server app do all of its data operations on a locally-stored SQLite database (not one stored in RAM); once you have a reasonable database server available, it should be a lot easier to then switch your database calls over to it.

My sense of it is that you’re sitting on a ticking time bomb here, by mirroring data in memory like you are doing. You didn’t specify how changes to the objects eventually end up in permanent storage, but it sure seems to me like you’re one minor hardware hiccup away from a huge tangle of corrupted and partially-saved data. As one smart developer once told me when I was trying to do too much in code, “let databases do what they’re good at: searching and ordering large data sets”.

Matthew_Stevens · January 11, 2022, 9:09am

Why the virtual infrastructure? The virtualisation layer is costing significant CPU cycles and IOPS. If there are any other guests on the virtual host, they will be contending for disk I/O, hammering the SQL Server performance. If there are no other guests, you don’t need the virtualisation layer. Getting the server on bare metal, is pretty much the first thing you should do.

The next thing would be to look at the db schema and the potential to offload searching and filtering into stored procedures.

Björn_Eiríksson · January 11, 2022, 11:18am

Virtualized infrastructure is pretty standard practice for large scale operations.

And such is done on Virtualizer without having big OS layer first making the overhead extremely small. Like for example with vSphere, those are definitely not comparable with Virtual Machines that you run on top of your OS with Parellis or other such.

Reasons why large scale operations are usually virtualized is better management, fail handling and backup capabilities.

MarkusR · January 11, 2022, 12:04pm

i think you need at least a superior field
as example
list year ← list customer ← list orders
somehow a search tree, you can find data faster

an in-memory SQLite db sounds great but this must be filled too
and it cause a lot of changes.

generally for in-memory database server exists special ram which hold data after interrupted power in cause you not have a uninterrupted power supply.

Matthew_Stevens · January 12, 2022, 10:07am

Yes, I spend more times in server rooms than writing Xojo applications.

Server virtualisation is what you do when you have headroom to spare. When there is no headroom, virtualisation becomes the cause of non-linear slow down and ‘interesting’ failures. The vSphere and MSSQL knowledge bases are peppered with articles about the pitfalls of running a busy DBMS on a single ESXi or HyperV. (Wonder how I know that.)