Improve speed of SQL query

My database has 3 tables:

  • Mailbox
  • Message
  • Attachment

I’m rewriting the main sql queries to improve search (and the code was quite kludgy). The search for attachments is in a separate query:

SELECT mailid 
FROM attachment 
WHERE attachment.attachmentname REGEX '(?i)png' 
GROUP BY MailID

Then I inserted the recids into the final query.

A double join is not fast enough with 1.7 seconds on 50k emails in the message table:

SELECT 
    Message.MailboxID,
    Message.RecID,
    Message.OriginatorFrom as 'From',
 (etc)
FROM Message 
INNER JOIN Mailbox ON Message.MailboxID = Mailbox.RecID
INNER JOIN Attachment ON Message.RecID = Attachment.MailID
WHERE Attachment.AttachmentName like '%png%'

Why is a subquery so super slow with 1.5 minutes(!!!) on the same 50k emails?

SELECT 
    Message.MailboxID,
    Message.RecID,
    Message.OriginatorFrom AS 'From',
(etc)
FROM Message 
INNER JOIN Mailbox ON Message.MailboxID = Mailbox.RecID
WHERE 
    Message.RecID IN (
        SELECT mailid 
        FROM attachment 
        WHERE attachment.attachmentname REGEX '(?i)png' 
        GROUP BY MailID) 

My idea was to get away from the kludge of having 2 queries. Instead I wanted to build a single SQL query. I could use a view instead but this also needs access to the database.

Any other ideas?

Do you have indexes for some/all of the fields you are matching against?

So far I have been using only REGEX ‘(?i)png’ for searching. As far as I remember indexing doesn’t help with regex.

You’re right about searching with REGEX or LIKE, indexing probably won’t help. In your example you’re searching for png attachments, could you add a boolean column indicating if the attachment is png? Or if you search for other types, a short text column for the attachment type? This could be indexed and should search fast. You would need to compute the value for this column when you store the record.

In Your second query, you’re running that regex command against every attachment in your database whereas in your first query you’re only looking for %png% in the attachments that belong to the current email. Regex tends to take longer than simple matches anyway and if there are lots of attachments, that’s going to add up.

Have you tried using EXPLAIN to find out where the costs are?

1 Like

The png search was just an example because I was too lazy to loo up an attachment name.

The first query is against all attachments in the attachment table. I’m not sure what the explain tells me:

I believe that this line:

DO Scan of Join Table on condition(s),
FOR EXPR DO Scan of Join Table on condition(s), which could not be resolved before join : TIME = 365360ms : ROWS = 620

Is telling us that your second join is taking 365 seconds to run.

I’m a little surprised that you are not using a LEFT JOIN here as your query wont return messages that don’t have any attachments when you use an INNER JOIN.

You may benefit from an index on your attachment table of attachmentID and mailID.

I’d also argue that if the number of attachments on an email is less than say 50, you might be better off pulling all of the records and then checking the extensions in code.

Instead of doing the subquery first the database does the join dynamically, I guess.

The query with the double join was just a speed test.

The index doesn’t do anything because of the regex.

As I don’t know the number of affected attachments before I do the query this doesn’t help.

For now I’ll try with a view.

try changing from:

to

WHERE exists (select attachment.mailid from
attachment where attachment.mailid=Message.RecID 
and attachment.attachmentname REGEX '(?i)png')

creating an index of mailid on attachment would greatly help in the process…

Great idea. I’m on a tiny database for testing which gives a result of 17 emails for the attachment search. The new query gives an empty result.

btw if just matching “is containing text” - using LIKE can be like 10 times faster than REGEX.

so try this:

WHERE exists (select attachment.mailid from
attachment where attachment.mailid=Message.RecID 
and attachment.attachmentname like '%.png%')

the file name ends with .png, so like '%.png%' should be like '%.png'

1 Like

you’re right :slight_smile:

The index I was suggesting has nothing to do with the regex call. I’m suggesting that you create an index of the attachmentID and the mailID.

The IDs have indexes.

I still need the regex for case insensitive searching. In Valentina there isn’t a big difference between like and regex.

@Pawel_Soltysinski : I really would like to use the “where exists” query. But it doesn’t give me any result.

I have created two tables:

Message (column RecID int only)
Attachment (columns RecID,MailID as int and AttachmentName as text)

with some rows in both tables connected by RecID/MailID and then some attachment names like test.jpg etc.

I am using MySQL and this one doesn’t have REGEX so I could only test it with ‘like’ with uppercase (to get around “I need it to be case insensitive”) but this surely works:

select Message.RecID from Message
where exists (select Attachment.MailID from Attachment where Attachment.MailID=Message.RecID and upper(Attachment.attachmentname) like '%.PNG')

try please with like, just for testing sake. I then tested this with Oracle and SQLite - works everywhere. So if it doesn’t work with Valentina… ;D

…on the other hand, the rule of thumb is to do main select from the table where doing searches is most costly - in this case it would be Attachment. This way you will search the “heaviest” table just once.

This should be fast:

SELECT distinct (Attachment.MailID) as Mail_ID, Message.MailboxID, Message.RecID, Message.OriginatorFrom
FROM attachment 
JOIN Message on (Message.RecID = Attachment.MailID)
WHERE attachment.attachmentname REGEX '(?i)png'

1 Like

I’ll do some testing probably earliest on Wednesday because everything needs to be packed up for the move tomorrow.