Skip to content

Stunningly Slow MySQL Mystery

Basically we’re talking end of the universe slow. How did this happen? It was a simple mistake, an unsuspecting query gone wrong. Who’s to blame? Me of course. But it is good to understand the subtle differences between a good query and a bad query.

It all started with a query to get some unique, random ID codes from my database. By “random id codes” I just mean a single-column result of 1000 randomized IDs that were NOT IN a matching ID column in another table. Something I wrote quickly and without sufficient thought was:

SELECT Id,Data
    FROM SourceData
   WHERE Id NOT IN (SELECT SourceID FROM Page GROUP BY SourceID)
     AND WebAddress != ''
ORDER BY RAND()
   LIMIT 1000

My thought was simply, “GROUP BY will consolidate all the results into a unique list, works.” Later while monitoring my script with top, I noticed MySQL process was consuming ~100% CPU and running for 3:30 minutes :-0 uh oh. Well it ended shortly thereafter but it made me investigate further. I used the fabled EXPLAIN command to delve further:

mysql> explain   SELECT Id,Data
    ->     FROM SourceData
    ->    WHERE Id NOT IN (SELECT SourceID FROM Page GROUP BY SourceID)
    ->      AND WebAddress != ''
    -> ORDER BY RAND()
    ->    LIMIT 1000;
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
| id | select_type        | table      | type  | possible_keys | key      | key_len | ref  | rows | Extra                                        |
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
|  1 | PRIMARY            | SourceData | ALL   | NULL          | NULL     | NULL    | NULL | 8379 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | Page       | index | NULL          | SourceID | 4       | NULL | 9369 | Using index                                  |
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
2 rows in set (0.05 sec)

So, of course, I have no idea what any of that means, but I assume it means that the query will run slow. The query is easy enough to rewrite:

SELECT Id,WebAddress
  FROM SourceData
 WHERE Id NOT IN (SELECT DISTINCT(SourceID) FROM OtherData) /* GROUP BY (SourceID) BAD! */
   AND WebAddress != ''
 ORDER BY RAND()
 LIMIT 1000;

Let’s see how that tests out:

1000 rows in set (0.11 sec)

Not bad, versus the results of the first one.. I think it’s finished by now:

1000 rows in set (8 min 9.95 sec)

I expect there’s a logical explanation why this performs this way. But still, (8 min 9.95 sec / 0.11 sec ) = 4,454 times longer, is stunning.

Published inDevelopment

Be First to Comment

Leave a Reply