Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize a JOIN ORDER BY RAND() mysql query in a large database

Tags:

I am working on a project which has a large Question Bank, and for Tests added to the System, 20 questions are fetched on Run-Time dynamically based on the following query:

SELECT Question.* from Question JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level 
ORDER BY RAND() 
LIMIT 20;

However, as it is known that the RAND() function the MySQL kills your server I have been looking for better solutions.

Result of EXPLAIN [above query]:

+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table    | type | possible_keys | key  | key_len | ref  | rows | Extra                                              |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
|  1 | SIMPLE      | Test     | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using temporary; Using filesort                    |
|  1 | SIMPLE      | Question | ALL  | NULL          | NULL | NULL    | NULL |    7 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+

Result of EXPLAIN Question:

+-------------------+------------------------------------------+------+-----+---------+----------------+
| Field             | Type                                     | Null | Key | Default | Extra          |
+-------------------+------------------------------------------+------+-----+---------+----------------+
| Question_ID       | int(11)                                  | NO   | PRI | NULL    | auto_increment |
| Questions         | varchar(100)                             | NO   |     | NULL    |                |
| Available_Options | varchar(200)                             | NO   |     | NULL    |                |
| Correct_Answer    | varchar(50)                              | NO   |     | NULL    |                |
| Subject_ID        | int(11)                                  | NO   |     | NULL    |                |
| Question_Level    | enum('Beginner','Intermediate','Expert') | NO   |     | NULL    |                |
| Created_By        | int(11)                                  | NO   |     | NULL    |                |
+-------------------+------------------------------------------+------+-----+---------+----------------+

Result of EXPLAIN Test:

+----------------+------------------------------------------+------+-----+---------+----------------+
| Field          | Type                                     | Null | Key | Default | Extra          |
+----------------+------------------------------------------+------+-----+---------+----------------+
| Test_ID        | int(11)                                  | NO   | PRI | NULL    | auto_increment |
| Test_Name      | varchar(50)                              | NO   |     | NULL    |                |
| Test_Level     | enum('Beginner','Intermediate','Expert') | NO   |     | NULL    |                |
| Subject_ID     | int(11)                                  | NO   |     | NULL    |                |
| Question_Count | int(11)                                  | NO   |     | NULL    |                |
| Created_By     | int(11)                                  | NO   |     | NULL    |                |
+----------------+------------------------------------------+------+-----+---------+----------------+

Any help would be appreciated to optimize the query to reduce server load and execution time.

P.S. The system has the capability of Deletion too so the AUTO_INCREMENT PRIMARY KEY of the QUESTION and TEST table can have large gaps.

like image 994
Ayush Gupta Avatar asked Aug 20 '16 14:08

Ayush Gupta


2 Answers

I like this question. It's a very good optimization puzzle, and let's assume for the moment that performance is very important for this query, and that you cannot use any dynamically inserted values (e.g. from PHP).

One high performance solution would be to add column with random values (say called "Rand"), order the table by this value, and periodically regenerate and re-order the table. You could then use a query like this one:

SELECT Question.* from Question 
JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level  
WHERE Question.Rand > RAND() 
LIMIT 20

This would perform at O(n), requiring only one scan of the table, but it would come with the risk of returning fewer than 20 results if a value very close to 1 was generated. If this was an acceptable risk (e.g. you could programmatically check for an inadequate result and re-query), you would end up with nice runtime performance.

The periodic re-generating and re-ordering of the numbers is necessary because rows early in the table with high Rand values would be favored and show up disproportionately frequently in the results. (Imagine if the first row was lucky enough to receive a Rand value of .95)

Even better would be to create a column with contiguous integers, index on this column, and then randomly choose an insertion point to grab 20 results. Such a query might look like this:

SELECT Question.* from Question 
JOIN Test 
ON Question.Subject_ID = Test.Subject_ID 
AND Question.Question_Level = Test.Test_Level  
CROSS JOIN (SELECT MAX(Rand_id) AS max_id FROM Question)
WHERE Question.Rand_Id > ROUND(RAND() * max_id)
LIMIT 20

But what if you can't alter your table in any way? If it doesn't matter how messy your SQL is, and there is a relatively low proportion of missing ids (say roughly 1/10th). You could achieve your 20 random questions with a good degree of probability with the following SQL:

SELECT Question.* from Question JOIN Test 
  ON Question.Subject_ID = Test.Subject_ID 
  AND Question.Question_Level = Test.Test_Level 
  WHERE Question.Question_ID IN (
    SELECT DISTINCT(ROUND(rand * max_id)) AS rand_id 
    FROM ( --generate 30 random numbers to make sure we get 20 results
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      ...
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand UNION ALL
      SELECT RAND() AS rand
    ) a 
    CROSS JOIN ( --get the max possible id from the Question table
      SELECT MAX(id) AS max_id FROM Question
    ) b
  )
LIMIT 20 --finally pare our results down to 20 in case we got too many

However, this will cause problems in your use case, because you effectively can't know how many results (and their IDs) will be in the result set after the join. After joining on subject and difficulty, the proportion of missing IDs might be very high and you might end up with far fewer than 20 results, even with several hundred random guesses of what IDs might be in a table.

If you're able to use logic from PHP (sounds like you are), a lot of high performance solutions open up. You could, for example, create in PHP an object whose job it was to store arrays of all the IDs of Questions with a particular subject and difficulty level. You could then pick 20 random array indexes and get back 20 valid IDs, allowing you to run a very simple query.

SELECT Question.* from Question WHERE Question_ID IN ($dynamically_inserted_ids)

Anyway, I hope this gets your imagination going with some possibilities.

like image 100
Nate Vaughan Avatar answered Nov 03 '22 00:11

Nate Vaughan


Why don't you rand the numbers in PHP and then select the questions by id? Here's the logic of my point:

$MIN       = 1;
$MAX       = 50000; // You may want to get the MAX from your database
$questions = '';

for($i = 0; $i < 20; $i++)
   $questions .= mt_rand($MIN, $MAX) . ',';

// Removes last comma
$questions = rtrim($questions, ',');

$query = "SELECT * FROM Question WHERE Question.id IN ($questions)";

Edit 1:

I was thinking about the problem, and it ocurred me that you can select all the ID's from your db and then pick 20 items using the array_rand() function.

$values    = array(1, 5, 10000, 102021, 1000000); // Your database ID's
$questions = array_rand($values, 20);

$questions[0];
$questions[1];
$questions[2]; // etc
like image 26
Linesofcode Avatar answered Nov 03 '22 02:11

Linesofcode