The following SQLite Update takes over 10 minutes at 100% CPU. The Devices table has 1,255 rows and the Connections table has 220,000 rows. The goal is to update the Devices table with the IPAddress of the most recent connection to that Device.
Sql = "UPDATE Devices SET IPAddress = (SELECT IPAddress FROM Connections WHERE Devices.DeviceID = Connections.DeviceID AND Connections.BeginTimeStamp = (SELECT MAX(Connections.BeginTimeStamp) FROM Connections WHERE Devices.DeviceID = Connections.DeviceID ))"
I’m wondering if better Indexes or a different query style can speed this up. Here’s how those tables are created:
[code]“CREATE TABLE Devices (DeviceID TEXT, DeviceName TEXT, IPAddress TEXT, LatestSysVersion TEXT, LatestAppVersion TEXT, PRIMARY KEY(DeviceID));”
“CREATE TABLE Connections (DeviceID TEXT, IPAddress TEXT, IPPort TEXT, BeginTimestamp TEXT, PRIMARY KEY(DeviceID, IPAddress, IPPort));”
An index on (Connections.DeviceID, Connections.BeginTimeStamp) should help considerably.
Most UPDATE statements will invoke a full table scan regardless of index structures… particularly with multiple sub-quereis as you show.
One way to determine if there is any way to speed it up is to do an EXPLAIN plan and analyze the result.
While he is doing a full table scan on Devices, so an index won’t make a difference there, he is doing a select on Connections, so an index on the fields being selected should help. Especially when using Max().
What platform and database are involved? Is the database local or over a network?
with those tables that small, 10 minutes seem extraordinary… It should be more like 10 seconds at worst with no indexs.
but yes, what Tim suggested should help… .and a slight mod to the query might help a bit
Sql = _
"UPDATE Devices a"+ _
" SET IPAddress = (SELECT IPAddress "+_
" FROM Connections b"+_
" WHERE a.DeviceID = b.DeviceID"+_
" AND b.BeginTimeStamp = (SELECT MAX(BeginTimeStamp) "+ _
" FROM Connections c " +_
" WHERE b.DeviceID = c.DeviceID ))"
other than restructuring (my preference), and using single letter alias (A, B and C)
the LAST deviceID is between table “B” and table “C”, NOT between table “A” and table “C”
This might result in some performance increase since it can optimize the last subquery without refering back to the first table
Gotta love how this forum formats “code” blocks
Thanks for the interesting looking suggestion. I’m still a DB noob and don’t yet have a good grasp on Indexes. My mind wants to treat an Index like a Key (which is unique) and an index on (Connections.DeviceID, Connections.BeginTimeStamp)
won’t be unique. I assume that’s OK?
To add your suggested Index do I just add an Index statement after I create the table? Or does it have to be part of the Create statement itself?
You can drop/add indexes at any time. And yes, it doesn’t have to be unique.
Without the index, the db has to read the entire Connections table to get the MAX value, then use that and read the entire table again to get the matching IPAddress. And repeat for each Devices record. That’s 1255 * 220000 * 220000 records to process! An index, even not unique, will reduce that number considerably. Even an index on just Connections.DeviceID will help a lot.
And I second Dave’s recommendation of using EXPLAIN to check what your queries are going to do.
It’s Xojo’s builtin SQLite database, on a local spinning disk, OSX 10.10, MacBook Pro, 2.7 GHz i7, 8GB RAM.
I thought I read that having DeviceID as one part of the Primary Key would result in having an Index on it???
Also note that I only need the Max Timestamp of the subset of Connection records that match the DeviceID of the current Devices record.
So I’m still a bit unclear… I thought I already had a DeviceID Index into the Connections table… and I don’t yet see how having an Index on Timestamp will help since I need to get the Timestamp value and not use it to access the record.
I’ll sleep on it and hopefully understand it better in the morning.
PK’s must be unique… no duplicate values in the database, attempting to enter a record with an existing PK will result in a database constraint error. Creating a PK does not automatically infer creation of an index.
Indexes do not have to be unique, sometimes they are, sometimes they are not… but they “group” the records with same “key” values (not to be confused with “primary key” values) together, and allow the database to use a quick search method to limit the number of records required to satisfiy a particular query.
HOWEVER… that all being said… there will be times when YOU think a particular index will help, but the database engine will disagree… that is where the EXPLAIN PLAN comes in… it lets you know what the database engine thinks about the best way to answer a query.
And you are correct… adding the Timestamp to the index really doesn’t help, the major use will be the DeviceID…
And remember that database engine will break your query into solvable sections, which is why I suggested what I did in the above response.
The problem is that you’re using multiple levels of subqueries which is a bad idea most of the time. You cannot always fix that by using indexes.
Use the following command:
SET IPAddress = substr((SELECT MAX(datetime(C.BeginTimeStamp) || C.IPAddress)
FROM Connections C
WHERE C.DeviceID = Devices.DeviceID), 20)
In my test environment this is much faster.
interesting approach… but I submit that the use of SUBSTR and || concat overshadow a properly define query
The datetime function has a fixed format of 19 characters (= “YYYY-MM-DD HH:MM:SS”), so using a substring from position 20 works. Sometimes you have to be a little bit creative to avoid pitfalls like multilevel subqueries.
In my test with 5.000 devices and 5.000.000 connections (1.000 connections per device) and an index on DeviceID, the UPDATE is done within 9 seconds.