I have a table in Postgres called tasks
. It records Mechanical Turk-style tasks. It has the following columns:
entity_name, text (the thing being reviewed)
reviewer_email, text (the email address of the person doing the reviewing)
result, boolean (the entry provided by the reviewer)
Each entity that needs to be reviewed leads to the generation of two task rows, each assigned to a different reviewer. When both reviewers disagree (e.g. their values for result
are not equal), the application kicks off a third task, assigned to a moderator. The moderators always have the same email domain.
I'm trying to get the counts for each time reviewer a reviewer has been overruled by a moderator, or affirmed by a moderator. I think I'm fairly close, but the last bit is proving tricky:
SELECT
reviewer_email,
COUNT(*) FILTER(
WHERE entity_name IN (
SELECT entity_name
FROM tasks
GROUP BY entity_name
HAVING
COUNT(*) FILTER (WHERE result IS NOT NULL) = 3 -- find the entities that have exactly three reviews
AND
-- this is the tricky part:
-- need something like:
-- WHERE current_review.result = moderator_review.result
)
) AS overruled_count
FROM
tasks
WHERE
result IS NOT NULL
GROUP BY
reviewer_email
HAVING
reviewer_email NOT LIKE '%@moderators-domain.net'
Sample data:
id | entity_name | reviewer_email | result
1 | apple | [email protected] | true
2 | apple | [email protected] | false
3 | apple | mod@@moderators-domain.net | true
4 | pair | [email protected] | true
5 | pair | [email protected] | false
6 | pair | mod@@moderators-domain.net | false
7 | kiwi | [email protected] | true
8 | kiwi | [email protected] | true
Desired results:
reviewer_email | overruled_count | affirmed_count
[email protected] | 1 | 1
[email protected] | 1 | 1
Bob and Alice each have done three reviews. On one review, they agreed, therefore there was no moderation. They disagreed on the other two reviews and were overruled once, and affirmed once by the moderator.
I believe the code above has me on the right track, but I'm definitely interested in other approaches to this.
I think this is a harder problem than you might realize. The following appends the moderator review to each non-moderator review:
select t.*, tm.result as moderator_result
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
tm.reviewer_email LIKE '%@moderators-domain.net';
From this, we can aggregate the results that you want:
select reviewer_email,
sum( (result = moderator_result)::int ) as moderator_agrees,
sum( (result <> moderator_result)::int ) as moderator_disagrees
from (select t.*, tm.result as moderator_result
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
tm.reviewer_email LIKE '%@moderators-domain.net'
) t
group by reviewer_email;
There may be a way to do this using filter
and even window functions. This method seems the most natural to me.
I should note that the subquery is not necessary, of course:
select t.reviewer_email,
sum( (t.result = tm.result)::int ) as moderator_agrees,
sum( (t.result <> tm.result)::int ) as moderator_disagrees
from tasks t join
tasks tm
on t.entity_name = tm.entity_name
where t.reviewer_email NOT LIKE '%@moderators-domain.net' and
tm.reviewer_email LIKE '%@moderators-domain.net'
group by t.reviewer_email;
Just adding some changes to make the query a bit easier to understand in my opinion. I'm guessing we also need to consider the case where we have users who have never been either affirmed or overruled (so the counts for them would be 0)
SELECT
tasks.reviewer_email,
COUNT(*) FILTER (WHERE tasks.result = modtasks.result) AS affirmed_count,
COUNT(*) FILTER (WHERE tasks.result <> modtasks.result) AS overruled_count
FROM tasks
LEFT JOIN tasks modtasks
ON modtasks.entity_name = tasks.entity_name
AND modtasks.reviewer_email LIKE '%@moderators-domain.net'
WHERE tasks.reviewer_email NOT LIKE '%@moderators-domain.net'
GROUP BY tasks.reviewer_email
Sample data
CREATE TABLE tasks
("id" int, "entity_name" text, "reviewer_email" text, "result" boolean)
;
INSERT INTO tasks
("id", "entity_name", "reviewer_email", "result")
VALUES
(1, 'apple', '[email protected]', 'true'),
(2, 'apple', '[email protected]', 'false'),
(3, 'apple', 'mod@@moderators-domain.net', 'true'),
(4, 'pair', '[email protected]', 'true'),
(5, 'pair', '[email protected]', 'false'),
(6, 'pair', 'mod@@moderators-domain.net', 'false'),
(7, 'kiwi', '[email protected]', 'true'),
(8, 'kiwi', '[email protected]', 'true')
;
Query 1
WITH
CTE_moderated_tasks
AS
(
SELECT
id AS mod_id
,entity_name AS mod_entity_name
,result AS mod_result
FROM tasks
WHERE reviewer_email LIKE '%@moderators-domain.net'
)
SELECT
tasks.reviewer_email
,SUM(CASE WHEN tasks.result <> mod_result THEN 1 ELSE 0 END) AS overruled_count
,SUM(CASE WHEN tasks.result = mod_result THEN 1 ELSE 0 END) AS affirmed_count
FROM
CTE_moderated_tasks
INNER JOIN tasks ON
tasks.entity_name = CTE_moderated_tasks.mod_entity_name
AND tasks.id <> CTE_moderated_tasks.mod_id
GROUP BY
tasks.reviewer_email
I split the query into two parts.
At first I want to find all tasks where moderator was involved (CTE_moderated_tasks
). It assumes that moderator can't be involved more than once in the same task.
This result is inner joined to original tasks
table thus naturally filtering out all tasks where moderator was not involved. This also gives us moderator opinion next to the reviewer opinion. This assumes that there are only two reviewers for the same task.
All that is left now is simple grouping by reviewers and counting how many times reviewer's and moderator's opinions matched. I used a classic SUM(CASE ...)
for this conditional aggregate.
You don't have to use CTE, I used it primarily for readability.
I'd also like to highlight that this query uses LIKE
only during one scan of the table. If there is an index on entity_name
the join may be rather efficient.
Result
| reviewer_email | overruled_count | affirmed_count |
|-----------------|-----------------|----------------|
| [email protected] | 1 | 1 |
| [email protected] | 1 | 1 |
.
Here is another variant without self-join, which may be more efficient. You need to test with your real data, indexes and hardware.
This query uses window function with partitioning by entity_name
to bring moderator result for each row without explicit self-join. You can use any aggregate function here (SUM
or MIN
or MAX
), because there will be at most one row from moderator for each entity_name
.
Then simple grouping with conditional aggregate give us the count.
Here conditional aggregate uses the fact that NULL
compared to any value never returns true. mod_result
for entities that don't have a moderator would have nulls and both result <> mod_result
and result = mod_result
would yield false, so such rows don't contribute to either count.
Final HAVING reviewer_email NOT LIKE '%@moderators-domain.net'
removes the count of moderator results themselves.
Again, you don't have to use CTE here and I used it primarily for readability. I'd recommend to run just the CTE first and examine intermediate results to understand how the query works.
Query 2
WITH
CTE
AS
(
SELECT
id
,entity_name
,reviewer_email
,result::int
,SUM(result::int)
FILTER (WHERE reviewer_email LIKE '%@moderators-domain.net')
OVER (PARTITION BY entity_name) AS mod_result
FROM tasks
)
SELECT
reviewer_email
,SUM(CASE WHEN result <> mod_result THEN 1 ELSE 0 END) AS overruled_count
,SUM(CASE WHEN result = mod_result THEN 1 ELSE 0 END) AS affirmed_count
FROM CTE
GROUP BY reviewer_email
HAVING reviewer_email NOT LIKE '%@moderators-domain.net'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With