Basically we’re talking end of the universe slow. How did this happen? It was a simple mistake, an unsuspecting query gone wrong. Who’s to blame? Me of course. But it is good to understand the subtle differences between a good query and a bad query.
It all started with a query to get some unique, random ID codes from my database. By “random id codes” I just mean a single-column result of 1000 randomized IDs that were NOT IN a matching ID column in another table. Something I wrote quickly and without sufficient thought was:
SELECT Id,Data
FROM SourceData
WHERE Id NOT IN (SELECT SourceID FROM Page GROUP BY SourceID)
AND WebAddress != ''
ORDER BY RAND()
LIMIT 1000
My thought was simply, “GROUP BY will consolidate all the results into a unique list, works.” Later while monitoring my script with top, I noticed MySQL process was consuming ~100% CPU and running for 3:30 minutes :-0 uh oh. Well it ended shortly thereafter but it made me investigate further. I used the fabled EXPLAIN command to delve further:
mysql> explain SELECT Id,Data
-> FROM SourceData
-> WHERE Id NOT IN (SELECT SourceID FROM Page GROUP BY SourceID)
-> AND WebAddress != ''
-> ORDER BY RAND()
-> LIMIT 1000;
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
| 1 | PRIMARY | SourceData | ALL | NULL | NULL | NULL | NULL | 8379 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | Page | index | NULL | SourceID | 4 | NULL | 9369 | Using index |
+----+--------------------+------------+-------+---------------+----------+---------+------+------+----------------------------------------------+
2 rows in set (0.05 sec)
So, of course, I have no idea what any of that means, but I assume it means that the query will run slow. The query is easy enough to rewrite:
SELECT Id,WebAddress
FROM SourceData
WHERE Id NOT IN (SELECT DISTINCT(SourceID) FROM OtherData) /* GROUP BY (SourceID) BAD! */
AND WebAddress != ''
ORDER BY RAND()
LIMIT 1000;
Let’s see how that tests out:
1000 rows in set (0.11 sec)
Not bad, versus the results of the first one.. I think it’s finished by now:
1000 rows in set (8 min 9.95 sec)
I expect there’s a logical explanation why this performs this way. But still, (8 min 9.95 sec / 0.11 sec ) = 4,454 times longer, is stunning.
Be First to Comment