There are some case we need to select random records in mysql. The normal way I did see many developers have used is ORDER BY RAND(). But they might not recognize that if the number of records increase, the query is extremely slow due to the fact that it must use filesort for the random result.
We will consider some different query approaches on a table with ~210k records, on a 8 cores machine with 32G of RAM (12G is dedicated for mysql usage, with optimized parameters are set by mysqltunner)
First, for ORDER BY RAND()
SELECT `companies`.* FROM `companies` WHERE (id_state IS NOT NULL) AND (id_state != '') AND (id_city IS NOT NULL) AND (id_city != '') ORDER BY RAND() ASC LIMIT 4;
The average query time is 7s. Let’s see how it works:
+----+-------------+-----------+-------+---------------------------------+----------------+---------+------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-----------+-------+---------------------------------+----------------+---------+------+--------+----------------------------------------------+ | 1 | SIMPLE | companies | range | Unique_Company,id_state,id_city | Unique_Company | 3 | NULL | 105796 | Using where; Using temporary; Using filesort | +----+-------------+-----------+-------+---------------------------------+----------------+---------+------+--------+----------------------------------------------+ 1 row in set (0.00 sec)
How to optimize it?
Many different ways, but the idea is to avoid using of ORDER BY RAND.
Method 1: Using RAND() in the where clause
SELECT `companies`.* FROM `companies` WHERE (id_state IS NOT NULL) AND (id_state != '') AND (id_city IS NOT NULL) AND (id_city != '') AND (RAND() < 0.1) LIMIT 4
The above mothod needs 0.0018s on the same dataset. Using EXPLAIN, we can easily see that the query does not use temporary table or filesort anymore. But there is still a trick on the 0.1 value on the experession: RAND() < 0.1
Method 2: Using a random starting ID with JOIN
SELECT `companies`.* FROM `companies` JOIN (SELECT (RAND() * (SELECT MAX(id) FROM companies)) AS id) AS rd WHERE companies.id >= rd.id AND (id_state IS NOT NULL) AND (id_state != '') AND (id_city IS NOT NULL) AND (id_city != '') LIMIT 4;
–> Needs 0.0005s to execute.
Another enhancement (or more stupid enhancement :P) is adding the where condition on the random starting id selection query as follows:
SELECT `companies`.* FROM `companies` JOIN (SELECT (RAND() * (SELECT MAX(id) FROM companies WHERE (id_state IS NOT NULL) AND (id_state != '') AND (id_city IS NOT NULL) AND (id_city != '') )) AS id) AS rd WHERE companies.id >= rd.id AND (id_state IS NOT NULL) AND (id_state != '') AND (id_city IS NOT NULL) AND (id_city != '') LIMIT 4;
–> This needs 0.1s to query, since it needs more time on the random selection query.
Conclusion
The need of displaying random records might be not popular, so when you need to use it, be sure not to make your server crash by a not-effective query way. So, to achieve an acceptable random level, you can choose which is appropriate for your project, just remember not to user ORDER BY RAND() on a quite big dataset.