Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL: return user table with calculated column for match percentage?

I'm currently writing a webapp that matches users based on answered question. I've realized my matching algorithm in just one query and tuned it so far that it takes 8.2ms to calculate the match percentage between 2 users. But my webapp has to take a list of users and iterate through the list performing this query. For 5000 users it took 50sec on my local machine. Is it possible to put everything in one query that returns one column with the user_id and one column with the calculated match? Or is a stored procedure an option?

I'm currently working with MySQL but willing to switch databases if needed.

For anyone interested in the schema and data, I've created a SQLFiddle: http://sqlfiddle.com/#!2/84233/1

and my matching query:

SELECT COALESCE(SQRT( (100.0*as1.actual_score/ps1.possible_score) * (100.0*as2.actual_score/ps2.possible_score) ) - (100/ps1.commonquestions), 0) AS perc
  FROM (SELECT SUM(imp.value) AS actual_score 
      FROM user_questions AS uq1
      INNER JOIN importances imp ON imp.id = uq1.importance
      INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id AND uq2.user_id = 101
        AND (uq1.accans1 = uq2.answer_id 
          OR uq1.accans2 = uq2.answer_id
          OR uq1.accans3 = uq2.answer_id
          OR uq1.accans4 = uq2.answer_id)
      WHERE uq1.user_id = 1) AS as1, 
  (SELECT SUM(value) AS possible_score, COUNT(*) AS commonquestions
      FROM user_questions AS uq1
      INNER JOIN importances ON importances.id = uq1.importance
      INNER JOIN user_questions uq2 ON uq1.question_id = uq2.question_id AND uq2.user_id = 101
      WHERE uq1.user_id = 1) AS ps1,
  (SELECT SUM(imp.value) AS actual_score 
      FROM user_questions AS uq1
      INNER JOIN importances imp ON imp.id = uq1.importance
      INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id AND uq2.user_id = 1
        AND (uq1.accans1 = uq2.answer_id 
          OR uq1.accans2 = uq2.answer_id
          OR uq1.accans3 = uq2.answer_id
          OR uq1.accans4 = uq2.answer_id)
      WHERE uq1.user_id = 101) AS as2, 
  (SELECT SUM(value) AS possible_score 
      FROM user_questions AS uq1
      INNER JOIN importances ON importances.id = uq1.importance
      INNER JOIN user_questions uq2 ON uq1.question_id = uq2.question_id AND uq2.user_id = 1
      WHERE uq1.user_id = 101) AS ps2
like image 358
Mexxer Avatar asked Nov 12 '22 19:11

Mexxer


1 Answers

I was bored, so: Here's a rewritten version of your query - based on a PostgreSQL port of your schema - that calculates the matches for all user pairings at once:

http://sqlfiddle.com/#!12/30524/6

I've checked and it produces the same results for the user pair (1,5).

WITH
userids(uid) AS (
    select distinct user_id from user_questions
),
users(u1,u2) AS (
    SELECT u1.uid, u2.uid FROM userids u1 CROSS JOIN userids u2 WHERE u1 <> u2
),
scores AS (
        SELECT
            sum(CASE WHEN uq2.answer_id IN (uq1.accans1, uq1.accans2, uq1.accans3, uq1.accans4) THEN imp.value ELSE 0 END) AS actual_score,
            sum(imp.value) AS potential_score,
            count(1) AS common_questions,
            users.u1,
            users.u2
        FROM user_questions AS uq1
        INNER JOIN importances imp ON imp.id = uq1.importance
        INNER JOIN user_questions uq2 ON uq2.question_id = uq1.question_id
        INNER JOIN users ON (uq1.user_id=users.u1 AND uq2.user_id=users.u2)
        GROUP BY u1, u2
),
score_pairs(u1,u2,u1_actual,u2_actual,u1_potential,u2_potential,common_questions) AS (
    SELECT s1.u1, s1.u2, s1.actual_score, s2.actual_score, s1.potential_score, s2.potential_score, s1.common_questions
    FROM scores s1 INNER JOIN scores s2 ON (s1.u1 = s2.u2 AND s1.u2 = s2.u1)
    WHERE s1.u1 < s1.u2
)
SELECT
    u1, u2, 
    COALESCE(SQRT( (100.0*u1_actual/u1_potential) * (100.0*u2_actual/u2_potential) ) - (100/common_questions), 0) AS "match"
FROM  score_pairs;

There's no reason you couldn't port this back to MySQL, as the CTE is only there for readability and doesn't do anything you can't do with FROM (SELECT ...). There's no WITH RECURSIVE clause and no CTE is referenced from more than one other CTE. You'd have a bit of a scary nested query, but that's just a formatting challenge.

Changes:

  • Generate a set of distinct users
  • Self-join that set of distinct users to create a set of user pairings
  • and then join on that list of pairings in the score query to produce a table of scores
  • Produce the scores table by combining the largely duplicate queries for possiblescore1 and possiblescore2, actualscore1 and actualscore2.
  • then summarize it in the final outer query

I haven't optimised the query; as written it runs in 5ms on my system. On bigger data it's possible you may need to restructure some of it or use tricks like converting some CTE clauses into SELECT ... INTO TEMPORARY TABLE temp table creation statements that you then index before querying.

It's also possible that you'll want to move the generation of the users rowset out of the CTE and into a FROM subquery clause of scores. That's because WITH is required to behave as an optimisation fence between clauses, so the database must materialize rows and can't use tricks like pushing clauses up or down.

like image 93
Craig Ringer Avatar answered Nov 15 '22 13:11

Craig Ringer