Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering within Postrgres aggregations

I have a table in Postgres called tasks. It records Mechanical Turk-style tasks. It has the following columns:

entity_name, text (the thing being reviewed)
reviewer_email, text (the email address of the person doing the reviewing)
result, boolean (the entry provided by the reviewer)

Each entity that needs to be reviewed leads to the generation of two task rows, each assigned to a different reviewer. When both reviewers disagree (e.g. their values for result are not equal), the application kicks off a third task, assigned to a moderator. The moderators always have the same email domain.

I'm trying to get the counts for each time reviewer a reviewer has been overruled by a moderator, or affirmed by a moderator. I think I'm fairly close, but the last bit is proving tricky:

SELECT
  reviewer_email,
  COUNT(*) FILTER(
    WHERE entity_name IN (
      SELECT entity_name
      FROM tasks
      GROUP BY entity_name
      HAVING
        COUNT(*) FILTER (WHERE result IS NOT NULL) = 3 -- find the entities that have exactly three reviews
      AND
        -- this is the tricky part: 
        -- need something like:
        -- WHERE current_review.result = moderator_review.result
    )
  ) AS overruled_count
FROM
  tasks
WHERE
  result IS NOT NULL
GROUP BY
  reviewer_email
HAVING
  reviewer_email NOT LIKE '%@moderators-domain.net'

Sample data:

id | entity_name | reviewer_email             | result
 1 | apple       | [email protected]              | true
 2 | apple       | [email protected]            | false
 3 | apple       | mod@@moderators-domain.net | true
 4 | pair        | [email protected]              | true
 5 | pair        | [email protected]            | false
 6 | pair        | mod@@moderators-domain.net | false
 7 | kiwi        | [email protected]              | true
 8 | kiwi        | [email protected]            | true

Desired results:

reviewer_email  | overruled_count | affirmed_count
[email protected]   |               1 |              1
[email protected] |               1 |              1

Bob and Alice each have done three reviews. On one review, they agreed, therefore there was no moderation. They disagreed on the other two reviews and were overruled once, and affirmed once by the moderator.

I believe the code above has me on the right track, but I'm definitely interested in other approaches to this.

like image 795
user94154 Avatar asked May 07 '18 01:05

user94154


3 Answers

I think this is a harder problem than you might realize. The following appends the moderator review to each non-moderator review:

select t.*, tm.result as moderator_result
from tasks t join
    tasks tm
    on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
      tm.reviewer_email LIKE '%@moderators-domain.net';

From this, we can aggregate the results that you want:

select reviewer_email,
       sum( (result = moderator_result)::int ) as moderator_agrees,
       sum( (result <> moderator_result)::int ) as moderator_disagrees
from (select t.*, tm.result as moderator_result
      from tasks t join
           tasks tm
           on t.entity_name = tm.entity_name
      where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
            tm.reviewer_email LIKE '%@moderators-domain.net'
     ) t
group by reviewer_email;

There may be a way to do this using filter and even window functions. This method seems the most natural to me.

I should note that the subquery is not necessary, of course:

select t.reviewer_email,
       sum( (t.result = tm.result)::int ) as moderator_agrees,
       sum( (t.result <> tm.result)::int ) as moderator_disagrees
from tasks t join
     tasks tm
     on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
      tm.reviewer_email LIKE '%@moderators-domain.net'
group by t.reviewer_email;
like image 133
Gordon Linoff Avatar answered Sep 28 '22 11:09

Gordon Linoff


Just adding some changes to make the query a bit easier to understand in my opinion. I'm guessing we also need to consider the case where we have users who have never been either affirmed or overruled (so the counts for them would be 0)

SELECT
    tasks.reviewer_email,
    COUNT(*) FILTER (WHERE tasks.result = modtasks.result) AS affirmed_count,
    COUNT(*) FILTER (WHERE tasks.result <> modtasks.result) AS overruled_count
  FROM tasks
  LEFT JOIN tasks modtasks
      ON modtasks.entity_name = tasks.entity_name
      AND modtasks.reviewer_email LIKE '%@moderators-domain.net'
 WHERE tasks.reviewer_email NOT LIKE '%@moderators-domain.net'
 GROUP BY tasks.reviewer_email
like image 36
jmriego Avatar answered Sep 28 '22 11:09

jmriego


Sample data

CREATE TABLE tasks
    ("id" int, "entity_name" text, "reviewer_email" text, "result" boolean)
;

INSERT INTO tasks
    ("id", "entity_name", "reviewer_email", "result")
VALUES
    (1, 'apple', '[email protected]', 'true'),
    (2, 'apple', '[email protected]', 'false'),
    (3, 'apple', 'mod@@moderators-domain.net', 'true'),
    (4, 'pair', '[email protected]', 'true'),
    (5, 'pair', '[email protected]', 'false'),
    (6, 'pair', 'mod@@moderators-domain.net', 'false'),
    (7, 'kiwi', '[email protected]', 'true'),
    (8, 'kiwi', '[email protected]', 'true')
;

Query 1

WITH
CTE_moderated_tasks
AS
(
  SELECT
    id AS mod_id
    ,entity_name AS mod_entity_name
    ,result AS mod_result
  FROM tasks
  WHERE reviewer_email LIKE '%@moderators-domain.net'
)
SELECT
  tasks.reviewer_email
  ,SUM(CASE WHEN tasks.result <> mod_result THEN 1 ELSE 0 END) AS overruled_count 
  ,SUM(CASE WHEN tasks.result = mod_result THEN 1 ELSE 0 END) AS affirmed_count 
FROM
  CTE_moderated_tasks
  INNER JOIN tasks ON 
    tasks.entity_name = CTE_moderated_tasks.mod_entity_name
    AND tasks.id <> CTE_moderated_tasks.mod_id
GROUP BY
  tasks.reviewer_email

I split the query into two parts.

At first I want to find all tasks where moderator was involved (CTE_moderated_tasks). It assumes that moderator can't be involved more than once in the same task.

This result is inner joined to original tasks table thus naturally filtering out all tasks where moderator was not involved. This also gives us moderator opinion next to the reviewer opinion. This assumes that there are only two reviewers for the same task.

All that is left now is simple grouping by reviewers and counting how many times reviewer's and moderator's opinions matched. I used a classic SUM(CASE ...) for this conditional aggregate.

You don't have to use CTE, I used it primarily for readability.

I'd also like to highlight that this query uses LIKE only during one scan of the table. If there is an index on entity_name the join may be rather efficient.

Result

|  reviewer_email | overruled_count | affirmed_count |
|-----------------|-----------------|----------------|
| [email protected] |               1 |              1 |
|   [email protected] |               1 |              1 |

.

Variant without self-join

Here is another variant without self-join, which may be more efficient. You need to test with your real data, indexes and hardware.

This query uses window function with partitioning by entity_name to bring moderator result for each row without explicit self-join. You can use any aggregate function here (SUM or MIN or MAX), because there will be at most one row from moderator for each entity_name.

Then simple grouping with conditional aggregate give us the count.

Here conditional aggregate uses the fact that NULL compared to any value never returns true. mod_result for entities that don't have a moderator would have nulls and both result <> mod_result and result = mod_result would yield false, so such rows don't contribute to either count.

Final HAVING reviewer_email NOT LIKE '%@moderators-domain.net' removes the count of moderator results themselves.

Again, you don't have to use CTE here and I used it primarily for readability. I'd recommend to run just the CTE first and examine intermediate results to understand how the query works.

Query 2

WITH
CTE
AS
(
  SELECT
    id
    ,entity_name
    ,reviewer_email
    ,result::int
    ,SUM(result::int)
      FILTER (WHERE reviewer_email LIKE '%@moderators-domain.net') 
      OVER (PARTITION BY entity_name) AS mod_result
  FROM tasks
)
SELECT
  reviewer_email
  ,SUM(CASE WHEN result <> mod_result THEN 1 ELSE 0 END) AS overruled_count 
  ,SUM(CASE WHEN result = mod_result THEN 1 ELSE 0 END) AS affirmed_count 
FROM CTE
GROUP BY reviewer_email
HAVING reviewer_email NOT LIKE '%@moderators-domain.net'
like image 41
Vladimir Baranov Avatar answered Sep 28 '22 10:09

Vladimir Baranov