I'm currently writing a webapp that matches users based on answered question. I've realized my matching algorithm in just one query and tuned it so far that it takes 8.2ms to calculate the match percentage between 2 users. But my webapp has to take a list of users and iterate through the list performing this query. For 5000 users it took 50sec on my local machine. Is it possible to put everything in one query that returns one column with the user_id and one column with the calculated match? Or is a stored procedure an option?
I'm currently working with MySQL but willing to switch databases if needed.
For anyone interested in the schema and data, I've created a SQLFiddle: http://sqlfiddle.com/#!2/84233/1
and my matching query:
SELECT COALESCE(SQRT( (100.0*as1.actual_score/ps1.possible_score) * (100.0*as2.actual_score/ps2.possible_score) ) - (100/ps1.commonquestions), 0) AS perc
FROM (SELECT SUM(imp.value) AS actual_score
FROM user_questions AS uq1
INNER JOIN importances imp ON imp.id = uq1.importance
INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id AND uq2.user_id = 101
AND (uq1.accans1 = uq2.answer_id
OR uq1.accans2 = uq2.answer_id
OR uq1.accans3 = uq2.answer_id
OR uq1.accans4 = uq2.answer_id)
WHERE uq1.user_id = 1) AS as1,
(SELECT SUM(value) AS possible_score, COUNT(*) AS commonquestions
FROM user_questions AS uq1
INNER JOIN importances ON importances.id = uq1.importance
INNER JOIN user_questions uq2 ON uq1.question_id = uq2.question_id AND uq2.user_id = 101
WHERE uq1.user_id = 1) AS ps1,
(SELECT SUM(imp.value) AS actual_score
FROM user_questions AS uq1
INNER JOIN importances imp ON imp.id = uq1.importance
INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id AND uq2.user_id = 1
AND (uq1.accans1 = uq2.answer_id
OR uq1.accans2 = uq2.answer_id
OR uq1.accans3 = uq2.answer_id
OR uq1.accans4 = uq2.answer_id)
WHERE uq1.user_id = 101) AS as2,
(SELECT SUM(value) AS possible_score
FROM user_questions AS uq1
INNER JOIN importances ON importances.id = uq1.importance
INNER JOIN user_questions uq2 ON uq1.question_id = uq2.question_id AND uq2.user_id = 1
WHERE uq1.user_id = 101) AS ps2
I was bored, so: Here's a rewritten version of your query - based on a PostgreSQL port of your schema - that calculates the matches for all user pairings at once:
http://sqlfiddle.com/#!12/30524/6
I've checked and it produces the same results for the user pair (1,5).
WITH
userids(uid) AS (
select distinct user_id from user_questions
),
users(u1,u2) AS (
SELECT u1.uid, u2.uid FROM userids u1 CROSS JOIN userids u2 WHERE u1 <> u2
),
scores AS (
SELECT
sum(CASE WHEN uq2.answer_id IN (uq1.accans1, uq1.accans2, uq1.accans3, uq1.accans4) THEN imp.value ELSE 0 END) AS actual_score,
sum(imp.value) AS potential_score,
count(1) AS common_questions,
users.u1,
users.u2
FROM user_questions AS uq1
INNER JOIN importances imp ON imp.id = uq1.importance
INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id
INNER JOIN users ON (uq1.user_id=users.u1 AND uq2.user_id=users.u2)
GROUP BY u1, u2
),
score_pairs(u1,u2,u1_actual,u2_actual,u1_potential,u2_potential,common_questions) AS (
SELECT s1.u1, s1.u2, s1.actual_score, s2.actual_score, s1.potential_score, s2.potential_score, s1.common_questions
FROM scores s1 INNER JOIN scores s2 ON (s1.u1 = s2.u2 AND s1.u2 = s2.u1)
WHERE s1.u1 < s1.u2
)
SELECT
u1, u2,
COALESCE(SQRT( (100.0*u1_actual/u1_potential) * (100.0*u2_actual/u2_potential) ) - (100/common_questions), 0) AS "match"
FROM score_pairs;
There's no reason you couldn't port this back to MySQL, as the CTE is only there for readability and doesn't do anything you can't do with FROM (SELECT ...)
. There's no WITH RECURSIVE
clause and no CTE is referenced from more than one other CTE. You'd have a bit of a scary nested query, but that's just a formatting challenge.
Changes:
I haven't optimised the query; as written it runs in 5ms on my system. On bigger data it's possible you may need to restructure some of it or use tricks like converting some CTE clauses into SELECT ... INTO TEMPORARY TABLE
temp table creation statements that you then index before querying.
It's also possible that you'll want to move the generation of the users
rowset out of the CTE and into a FROM
subquery clause of scores
. That's because WITH
is required to behave as an optimisation fence between clauses, so the database must materialize rows and can't use tricks like pushing clauses up or down.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With