Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to simply and efficiently query for nested relationships in SQL?

Tags:

sql

postgresql

I'm looking to write the simplest, most efficient SQL query to retrieve all of the events that are related to a given user.


Setup

Here's a simplistic representation of what my schema looks like:

enter image description here

A couple things to note:

  • users belong to teams via memberships.
  • teams can have many collections, apps, and webhooks.
  • collections can also have many webhooks.
  • webhooks can belong to either a team or collection, but only one.
  • events can belong to any object, but only one.

This seems like a fairly basic setup that most SaaS-type companies would have (eg. Slack or Stripe). Everything is "owned" by the teams, but users belong to teams and interact with the interface.


Problem

Given that setup, I'd like to create a SQL query that solves...

Find all of the events that are related (directly or indirectly) to a given user by id.

I can easily write queries that find either directly, or indirectly via a specific means. For example...

Find all of the events that are directly related to a user by id.

SELECT *
FROM events
WHERE user_id = ${id}

Or...

Find all of the events that are indirectly related to a user via their teams.

SELECT events.*
FROM events
JOIN memberships ON memberships.team_id = events.team_id
WHERE memberships.user_id = ${id}

Or even...

Find all of the events that are indirectly related to a user via any collections of their teams.

SELECT events.*
FROM events
JOIN collections ON collections.id = events.collection_id
JOIN memberships ON memberships.team_id = collections.team_id
WHERE memberships.user_id = ${id}

Webhooks get a more complex, because they can be related in two different ways...

Find all of the events that are indirectly related to a user via any webhooks of their teams or collections.

SELECT *
FROM events
WHERE webhook_id IN (
  SELECT webhooks.id
  FROM webhooks
  JOIN memberships ON memberships.team_id = webhooks.team_id
  WHERE memberships.user_id = ${id}
)
OR webhook_id IN (
  SELECT webhooks.id
  FROM webhooks
  JOIN collections ON collections.id = webhooks.collection_id
  JOIN memberships ON memberships.team_id = collections.team_id
  WHERE memberships.user_id = ${id}
)

But as you can see, there are a lot of different ways for a user to be related to an event that occured, via all those paths! So when I try to a query that successfully gets all of those related events, it ends up looking like...

SELECT * 
FROM events
WHERE user_id = ${id}
OR app_id IN (
  SELECT apps.id
  FROM apps
  JOIN memberships ON memberships.team_id = apps.team_id
  WHERE memberships.user_id = ${id}
)
OR collection_id IN (
  SELECT collections.id
  FROM collections
  JOIN memberships ON memberships.team_id = collections.team_id
  WHERE memberships.user_id = ${id}
)
OR memberships_id IN (
  SELECT id
  FROM memberships
  WHERE user_id = ${id}
)
OR team_id IN (
  SELECT team_id
  FROM memberships
  WHERE user_id = ${id}
)
OR webhook_id IN (
  SELECT webhooks.id
  FROM webhooks
  JOIN memberships ON memberships.team_id = webhooks.team_id
  WHERE memberships.user_id = ${id}
)
OR webhook_id IN (
  SELECT webhooks.id
  FROM webhooks
  JOIN collections ON collections.id = webhooks.collection_id
  JOIN memberships ON memberships.team_id = collections.team_id
  WHERE memberships.user_id = ${id}
)

Questions

  • Is that final "all included" query very inefficient?
  • Is there a more efficient way to write it?
  • Is there a simpler, easier-to-read-later way to write it?
like image 526
Ian Storm Taylor Avatar asked Apr 06 '17 23:04

Ian Storm Taylor


3 Answers

As with any query, the most efficient method is "it depends". There's many variables in play - the number of rows in tables, the row lengths, whether indices exist, the RAM on the server, etc etc.

The best way I can think of handling this sort of problem (thinking maintainability and a braod approach to efficiency) is by using CTEs, which allows you to create a temporary result and reuse that result throughout your query. CTEs use the WITH keyword, and essentially alias a result as a table, so that you can JOIN against it multiple times:

WITH user_memberships AS (
    SELECT *
    FROM memberships
    WHERE user_id = ${id}
), user_apps AS (
    SELECT *
    FROM apps
    INNER JOIN user_memberships
        ON user_memberships.team_id = apps.team_id
), user_collections AS (
    SELECT *
    FROM collections
    INNER JOIN user_memberships
        ON user_memberships.team_id = collections.team_id
), user_webhooks AS (
    SELECT *
    FROM webhooks
    LEFT OUTER JOIN user_collections ON user_collections.id = webhooks.collection_id
    INNER JOIN user_memberships
        ON user_memberships.team_id = webhooks.team_id
        OR user_memberships.team_id = user_collections.team_id
)

SELECT events.* 
FROM events
WHERE app_id IN (SELECT id FROM user_apps)
OR collection_id IN (SELECT id FROM user_collections)
OR membership_id IN (SELECT id FROM user_memberships)
OR team_id IN (SELECT team_id FROM user_memberships)
OR user_id = ${id}
OR webhook_id IN (SELECT id FROM user_webhooks)
;

The benefits of doing it this way are:

  1. Each CTE can take advantage of an index on the appropriate JOIN predicates and return results for just that subset faster, rather than have the execution planner attempt to resolve a series of complex predicates
  2. The CTEs can be individually maintained, making troubleshooting problems with subsets easier
  3. You're not violating the DRY principle
  4. If the CTE has value outside of the query, you can move it into a stored procedure and reference that instead
like image 106
e_i_pi Avatar answered Oct 20 '22 21:10

e_i_pi


The only thing I can think of that would make it faster is to use unions.

SELECT e.* 
FROM events e
WHERE user_id = ${id}
UNION 
select e.*
  FROM apps a
  join events e on a.apps_id = e.apps_id
  JOIN memberships ON memberships.team_id = apps.team_id
  WHERE memberships.user_id = ${id}
UNION
select e.*
from 
  FROM collections c 
  join events e on e.collections_id = c.collections_id
  JOIN memberships ON memberships.team_id = collections.team_id
  WHERE memberships.user_id = ${id}
UNION
select e.*
  FROM memberships m
  join events e on e.memberships_id = e.memberships_id
  WHERE user_id = ${id}
UNION
...;
like image 33
Uncle Iroh Avatar answered Oct 20 '22 21:10

Uncle Iroh


I don't know how much control you have over your schema. If the answer is "none" then read no further. I'm not going to put much detail down here in case it's not appropriate to your situation but it looks like an ownership model to me.

i.e.

BaseTable

Id

IdOwner (FK to Id on BaseTable - very important)

Type (User = 0, App = 1, Collection = 2 etc. or use an enumeration)

App

Id (FK to BaseTable)

Collection

Id (FK to BaseTable)

Membership

Id (FK to BaseTable)

Webhooks

Id (FK to BaseTable)

Team

Id (FK to BaseTable)

Events

Id (FK to BaseTable)

Memberships

Team_Id (FK to Basetable or Team)

User_Id (FK to Basetable or Users)

Users

Id (FK to BaseTable)

Then your query becomes a recursive CTE: "Find me all objects of type Event owned by - or eventually owned by user x"

That would give you a list of id's which you'd then have to join to your Events table and you have your objects.

This sort of model does get a bit hairy because to load anything you have to join it with the base table but for this sort of nested ownership it works quite nicely.

I'd like to post this as a comment but if I do the formatting will disappear so I've posted it as an answer. If it helps and you want a bit more detail feel free to get back to me.

If I've completely missed the point and this doesn't help please don't shout at me (had that before on SO) just say "thanks, Adam, but that doesn't help" and I'll delete it.

Kind regards,

Adam.

like image 3
Adam Benson Avatar answered Oct 20 '22 20:10

Adam Benson