I am working on a project which has a large Question
Bank, and for Tests added
to the System, 20 questions are fetched on Run-Time dynamically based on the following query:
SELECT Question.* from Question JOIN Test
ON Question.Subject_ID = Test.Subject_ID
AND Question.Question_Level = Test.Test_Level
ORDER BY RAND()
LIMIT 20;
However, as it is known that the RAND()
function the MySQL kills your server I have been looking for better solutions.
Result of EXPLAIN [above query]
:
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
| 1 | SIMPLE | Test | ALL | NULL | NULL | NULL | NULL | 5 | Using temporary; Using filesort |
| 1 | SIMPLE | Question | ALL | NULL | NULL | NULL | NULL | 7 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------+---------------+------+---------+------+------+----------------------------------------------------+
Result of EXPLAIN Question
:
+-------------------+------------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+------------------------------------------+------+-----+---------+----------------+
| Question_ID | int(11) | NO | PRI | NULL | auto_increment |
| Questions | varchar(100) | NO | | NULL | |
| Available_Options | varchar(200) | NO | | NULL | |
| Correct_Answer | varchar(50) | NO | | NULL | |
| Subject_ID | int(11) | NO | | NULL | |
| Question_Level | enum('Beginner','Intermediate','Expert') | NO | | NULL | |
| Created_By | int(11) | NO | | NULL | |
+-------------------+------------------------------------------+------+-----+---------+----------------+
Result of EXPLAIN Test
:
+----------------+------------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------------------------------------+------+-----+---------+----------------+
| Test_ID | int(11) | NO | PRI | NULL | auto_increment |
| Test_Name | varchar(50) | NO | | NULL | |
| Test_Level | enum('Beginner','Intermediate','Expert') | NO | | NULL | |
| Subject_ID | int(11) | NO | | NULL | |
| Question_Count | int(11) | NO | | NULL | |
| Created_By | int(11) | NO | | NULL | |
+----------------+------------------------------------------+------+-----+---------+----------------+
Any help would be appreciated to optimize the query to reduce server load and execution time.
P.S. The system has the capability of Deletion too so the AUTO_INCREMENT PRIMARY KEY of the QUESTION and TEST table can have large gaps.
I like this question. It's a very good optimization puzzle, and let's assume for the moment that performance is very important for this query, and that you cannot use any dynamically inserted values (e.g. from PHP).
One high performance solution would be to add column with random values (say called "Rand"), order the table by this value, and periodically regenerate and re-order the table. You could then use a query like this one:
SELECT Question.* from Question
JOIN Test
ON Question.Subject_ID = Test.Subject_ID
AND Question.Question_Level = Test.Test_Level
WHERE Question.Rand > RAND()
LIMIT 20
This would perform at O(n), requiring only one scan of the table, but it would come with the risk of returning fewer than 20 results if a value very close to 1 was generated. If this was an acceptable risk (e.g. you could programmatically check for an inadequate result and re-query), you would end up with nice runtime performance.
The periodic re-generating and re-ordering of the numbers is necessary because rows early in the table with high Rand values would be favored and show up disproportionately frequently in the results. (Imagine if the first row was lucky enough to receive a Rand value of .95)
Even better would be to create a column with contiguous integers, index on this column, and then randomly choose an insertion point to grab 20 results. Such a query might look like this:
SELECT Question.* from Question
JOIN Test
ON Question.Subject_ID = Test.Subject_ID
AND Question.Question_Level = Test.Test_Level
CROSS JOIN (SELECT MAX(Rand_id) AS max_id FROM Question)
WHERE Question.Rand_Id > ROUND(RAND() * max_id)
LIMIT 20
But what if you can't alter your table in any way? If it doesn't matter how messy your SQL is, and there is a relatively low proportion of missing ids (say roughly 1/10th). You could achieve your 20 random questions with a good degree of probability with the following SQL:
SELECT Question.* from Question JOIN Test
ON Question.Subject_ID = Test.Subject_ID
AND Question.Question_Level = Test.Test_Level
WHERE Question.Question_ID IN (
SELECT DISTINCT(ROUND(rand * max_id)) AS rand_id
FROM ( --generate 30 random numbers to make sure we get 20 results
SELECT RAND() AS rand UNION ALL
SELECT RAND() AS rand UNION ALL
SELECT RAND() AS rand UNION ALL
SELECT RAND() AS rand UNION ALL
...
SELECT RAND() AS rand UNION ALL
SELECT RAND() AS rand UNION ALL
SELECT RAND() AS rand
) a
CROSS JOIN ( --get the max possible id from the Question table
SELECT MAX(id) AS max_id FROM Question
) b
)
LIMIT 20 --finally pare our results down to 20 in case we got too many
However, this will cause problems in your use case, because you effectively can't know how many results (and their IDs) will be in the result set after the join. After joining on subject and difficulty, the proportion of missing IDs might be very high and you might end up with far fewer than 20 results, even with several hundred random guesses of what IDs might be in a table.
If you're able to use logic from PHP (sounds like you are), a lot of high performance solutions open up. You could, for example, create in PHP an object whose job it was to store arrays of all the IDs of Questions with a particular subject and difficulty level. You could then pick 20 random array indexes and get back 20 valid IDs, allowing you to run a very simple query.
SELECT Question.* from Question WHERE Question_ID IN ($dynamically_inserted_ids)
Anyway, I hope this gets your imagination going with some possibilities.
Why don't you rand the numbers in PHP and then select the questions by id? Here's the logic of my point:
$MIN = 1;
$MAX = 50000; // You may want to get the MAX from your database
$questions = '';
for($i = 0; $i < 20; $i++)
$questions .= mt_rand($MIN, $MAX) . ',';
// Removes last comma
$questions = rtrim($questions, ',');
$query = "SELECT * FROM Question WHERE Question.id IN ($questions)";
Edit 1:
I was thinking about the problem, and it ocurred me that you can select all the ID's from your db and then pick 20 items using the array_rand() function.
$values = array(1, 5, 10000, 102021, 1000000); // Your database ID's
$questions = array_rand($values, 20);
$questions[0];
$questions[1];
$questions[2]; // etc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With