I'm relatively new to sql and have been trying to get a pretty complex query (for me) to work for a while, but I've kept on getting duplicate values within each column using node postgres. Through this query I am trying to track user info, plan info and email info on a dashboard. Before we get to the query, here are the tables -
USER TABLE (u) - keep track of user info
+----+-------+---------+-------------+----------+
| id | first | last | email | password |
+----+-------+---------+-------------+----------+
| 1 | joe | smith | [email protected] | 1234 |
| 2 | mary | johnson | [email protected] | 3445 |
| 3 | harry | gold | [email protected] | 4345 |
+----+-------+---------+-------------+----------+
PLANS TABLE (p) - plans that users can make with their friends
+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| id | experienceid | hostid(u) | guestid(u) | date | time | paidid(u) | groupid | newp |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+
| 33 | 1 | 1 | [1,2,3] | 4/20 | 8:00pm | [1,2] | 1 | true |
+----+--------------+-----------+------------+------+--------+-----------+---------+------+
EMAIL TABLE (e) - keep track of messages I am sending to the users based on the plan they are a part of
+-------------+-----------+---------+----------+
| email(u) | planid(p) | confirm | reminder |
+-------------+-----------+---------+----------+
| [email protected] | 33 | null | null |
| [email protected] | 33 | true | false |
| [email protected] | 33 | true | false |
+-------------+-----------+---------+----------+
Now for the query what I am trying to do is combine all three tables to get this output -
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
| id(p) | hostname(u+p) | paidguests(u+p) | time(p) | newp(p) | groupid(p) | reminder(e) | confirm(e) |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
| 33 | joe smith | [joe smith, mary johnson] | 8:00pm | true | 1 | [true, false] | [true, false] |
+-------+---------------+---------------------------+---------+---------+------------+---------------+---------------+
Now where I left off in the query, I almost got it to work, but I kept getting duplicate values where it would look like this -
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
| id(p) | hostname(u+p) | paidguests(u+p) | time(p) | newp(p) | groupid(p) | reminder(e) | confirm(e) |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
| 33 | joe smith | [joe smith, mary johnson, joe smith, mary johnson] | 8:00pm | true | 1 | [true, false, true, false] | [true, false, true false] |
+-------+---------------+----------------------------------------------------+---------+---------+------------+----------------------------+---------------------------+
Now I don't really care about order for the confirm and reminder columns relative to the paidguests(u+p) column, as long as the right data is in there and not duplicated. Here is the query I had as it stands -
SELECT p.id,
Concat(u.first, ' ', u.last) AS hostname,
Array_agg(Concat(us.first, ' ', us.last)) AS paidguests,
p.time,
p.groupid,
p.newp,
Array_agg(e.confirm) AS confirm,
Array_agg(e.reminder) AS reminder
FROM plans p
CROSS JOIN Unnest(p.paidid) AS allguests
LEFT JOIN users us
ON allguests = us.id
LEFT JOIN emails e
ON p.id = e.planid
LEFT JOIN users u
ON p.hostid = u.id
WHERE p.experienceid = $1
AND p.date = $2
GROUP BY p.id,
u.first,
u.last,
p.paidid,
p.time,
p.groupid,
p.newp,
confirm,
reminder
ORDER BY Array_length(p.paidid, 1) DESC
So essentially just looking to get the table right without the duplicates. It was working before I added the join to the email table, but not entirely sure why its duplicating.
Hope I was thorough in the explanation. If not, please let me know what I can clarify! Thanks so much :)
try adding this condition in where clause:
AND us.email = e.email
Culprit here is -> all emails are joined with all the users irrespective of their email id since all users and all emails have same plan_id. Hence the duplication.
Rahul spotted the missing join condition. But the rabbit hole goes deeper. I suggest this query:
SELECT p.id
, concat_ws(' ', u.first, u.last) AS hostname -- concat_ws!
, p.time
, p.groupid
, p.newp
, paid.paidguests
, paid.confirm
, paid.reminder
FROM plans p
LEFT JOIN users u ON u.id = p.hostid
LEFT JOIN LATERAL ( -- LATERAL join
SELECT array_agg(sub.paidguest) AS paidguests
, array_agg(sub.confirm) AS confirm
, array_agg(sub.reminder) AS reminder
FROM (
SELECT concat_ws(' ', us.first, us.last) AS paidguest, e.confirm, e.reminder
FROM unnest(p.paidid) WITH ORDINALITY AS paid(id, ord)
JOIN users us ON us.id = paid.id
LEFT JOIN emails e ON e.email = us.email
AND e.planid = p.planid
ORDER BY paid.ord
) sub
) paid ON true
WHERE p.experienceid = $1
AND p.date = $2
-- no GROUP BY needed
ORDER BY cardinality(p.paidid) DESC, p.id;
Assuming that (planid, email) is the PRIMARY KEY of the email table and there is a FOREIGN KEY constraint from email.email to users.email.
Aggregate first, then join, so you don't need to GROUP BY on all the columns that don't need aggregation. While retrieving all or most rows, other query techniques are typically faster, for a small selection like in your example, I suggest a LATERAL join. Related:
In this particular case JOIN LATERAL would do the same as LEFT JOIN LATERAL since a subuery with aggregates and no GROUP BY always returns exactly 1 row.
The alias in your original Unnest(p.paidid) AS allguests is actively confusing, since those seem to be the IDs of guests who paid, not of all guests.
Use concat_ws() if first or last can be null. See:
When unnesting arrays, the order of elements is typically preserved in the simple case. But you have additional joins, so use WITH ORDINALITY and ORDER BY to avoid surprises. Your query may seem to work alright, even for a prolonged period of time - and then "suddenly" break (wrong order of elements) if you are not explicit about this. See:
The whole DB design is debatable. Typically, these arrays should rather be implemented as related tables - for many reasons, beyond the scope of this question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With