I’m rewriting the main sql queries to improve search (and the code was quite kludgy). The search for attachments is in a separate query:
SELECT mailid
FROM attachment
WHERE attachment.attachmentname REGEX '(?i)png'
GROUP BY MailID
Then I inserted the recids into the final query.
A double join is not fast enough with 1.7 seconds on 50k emails in the message table:
SELECT
Message.MailboxID,
Message.RecID,
Message.OriginatorFrom as 'From',
(etc)
FROM Message
INNER JOIN Mailbox ON Message.MailboxID = Mailbox.RecID
INNER JOIN Attachment ON Message.RecID = Attachment.MailID
WHERE Attachment.AttachmentName like '%png%'
Why is a subquery so super slow with 1.5 minutes(!!!) on the same 50k emails?
SELECT
Message.MailboxID,
Message.RecID,
Message.OriginatorFrom AS 'From',
(etc)
FROM Message
INNER JOIN Mailbox ON Message.MailboxID = Mailbox.RecID
WHERE
Message.RecID IN (
SELECT mailid
FROM attachment
WHERE attachment.attachmentname REGEX '(?i)png'
GROUP BY MailID)
My idea was to get away from the kludge of having 2 queries. Instead I wanted to build a single SQL query. I could use a view instead but this also needs access to the database.
You’re right about searching with REGEX or LIKE, indexing probably won’t help. In your example you’re searching for png attachments, could you add a boolean column indicating if the attachment is png? Or if you search for other types, a short text column for the attachment type? This could be indexed and should search fast. You would need to compute the value for this column when you store the record.
In Your second query, you’re running that regex command against every attachment in your database whereas in your first query you’re only looking for %png% in the attachments that belong to the current email. Regex tends to take longer than simple matches anyway and if there are lots of attachments, that’s going to add up.
Have you tried using EXPLAIN to find out where the costs are?
DO Scan of Join Table on condition(s),
FOR EXPR DO Scan of Join Table on condition(s), which could not be resolved before join : TIME = 365360ms : ROWS = 620
Is telling us that your second join is taking 365 seconds to run.
I’m a little surprised that you are not using a LEFT JOIN here as your query wont return messages that don’t have any attachments when you use an INNER JOIN.
You may benefit from an index on your attachment table of attachmentID and mailID.
I’d also argue that if the number of attachments on an email is less than say 50, you might be better off pulling all of the records and then checking the extensions in code.
Message (column RecID int only)
Attachment (columns RecID,MailID as int and AttachmentName as text)
with some rows in both tables connected by RecID/MailID and then some attachment names like test.jpg etc.
I am using MySQL and this one doesn’t have REGEX so I could only test it with ‘like’ with uppercase (to get around “I need it to be case insensitive”) but this surely works:
select Message.RecID from Message
where exists (select Attachment.MailID from Attachment where Attachment.MailID=Message.RecID and upper(Attachment.attachmentname) like '%.PNG')
try please with like, just for testing sake. I then tested this with Oracle and SQLite - works everywhere. So if it doesn’t work with Valentina… ;D
…on the other hand, the rule of thumb is to do main select from the table where doing searches is most costly - in this case it would be Attachment. This way you will search the “heaviest” table just once.
This should be fast:
SELECT distinct (Attachment.MailID) as Mail_ID, Message.MailboxID, Message.RecID, Message.OriginatorFrom
FROM attachment
JOIN Message on (Message.RecID = Attachment.MailID)
WHERE attachment.attachmentname REGEX '(?i)png'