Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving Postgres performance on graph-like queries of multi-level self-joins (comparison to Neo4j)

One of the claims Neo4j makes in their marketing is that relational databases aren't good at doing multi-level self-join queries:

enter image description here

I found the code repository corresponding to the book that claim is taken from, and translated it to Postgres:

CREATE TABLE t_user (
  id bigserial PRIMARY KEY,
  name text NOT NULL
);

CREATE TABLE t_user_friend (
  id bigserial PRIMARY KEY,
  user_1 bigint NOT NULL REFERENCES t_user,
  user_2 bigint NOT NULL REFERENCES t_user
);

CREATE INDEX idx_user_friend_user_1 ON t_user_friend (user_1);
CREATE INDEX idx_user_friend_user_2 ON t_user_friend (user_2);

/* Create 1M users, each getting a random 10-character name */
INSERT INTO t_user (id, name)
  SELECT x.id, substr(md5(random()::text), 0, 10)
  FROM generate_series(1,1000000) AS x(id);

/* For each user, create 50 random friendships for a total of 50M friendship records */
INSERT INTO t_user_friend (user_1, user_2)
  SELECT g1.x AS user_1, (1 + (random() * 999999)) :: int AS user_2
  FROM generate_series(1, 1000000) as g1(x), generate_series(1, 50) as g2(y);

And these are the queries at various depths Neo4j is comparing against:

/* Depth 2 */

SELECT
  COUNT(DISTINCT f2.user_2) AS cnt 
FROM
  t_user_friend f1 
  INNER JOIN
    t_user_friend f2 
    ON f1.user_2 = f2.user_1 
WHERE
  f1.user_1 = 1;

/* Depth 3 */

SELECT
  COUNT(DISTINCT f3.user_2) AS cnt 
FROM
  t_user_friend f1 
  INNER JOIN
    t_user_friend f2 
    ON f1.user_2 = f2.user_1 
  INNER JOIN
    t_user_friend f3 
    ON f2.user_2 = f3.user_1 
WHERE
  f1.user_1 = 1;

/* Depth 4 */

SELECT
  COUNT(DISTINCT f4.user_2) AS cnt 
FROM
  t_user_friend f1 
  INNER JOIN
    t_user_friend f2 
    ON f1.user_2 = f2.user_1 
  INNER JOIN
    t_user_friend f3 
    ON f2.user_2 = f3.user_1 
  INNER JOIN
    t_user_friend f4 
    ON f3.user_2 = f4.user_1 
WHERE
  f1.user_1 = 1;

/* Depth 5 */

SELECT
  COUNT(DISTINCT f5.user_2) AS cnt 
FROM
  t_user_friend f1 
  INNER JOIN
    t_user_friend f2 
    ON f1.user_2 = f2.user_1 
  INNER JOIN
    t_user_friend f3 
    ON f2.user_2 = f3.user_1 
  INNER JOIN
    t_user_friend f4 
    ON f3.user_2 = f4.user_1 
  INNER JOIN
    t_user_friend f5 
    ON f4.user_2 = f5.user_1 
WHERE
  f1.user_1 = 1;

I was roughly able to reproduce the book's claimed results, getting these sorts of execution times against the 1M users, 50M friendships:

| Depth | Count(*) | Time (s) |
|-------|----------|----------|
| 2     | 2497     | 0.067    |
| 3     | 117301   | 0.118    |
| 4     | 997246   | 8.409    |
| 5     | 999999   | 214.56   |

(Here's an EXPLAIN ANALYZE of a depth 5 query)

My question is, is there a way to improve the performance of these queries to meet or exceed Neo4j's execution time of ~2s at depth level 5?

I tried with this recursive CTE:

WITH RECURSIVE chain(user_2, depth) AS (
  SELECT t.user_2, 1 as depth
  FROM t_user_friend t
  WHERE t.user_1 = 1
UNION
  SELECT t.user_2, c.depth + 1 as depth
  FROM t_user_friend t, chain c
  WHERE t.user_1 = c.user_2
  AND depth < 4
)
SELECT COUNT(*)
FROM (SELECT DISTINCT user_2 FROM chain) AS temp;

However it's still pretty slow, with a depth 4 taking 5s and a depth 5 taking 48s (EXPLAIN ANALYZE)

like image 426
Abe Voelker Avatar asked Oct 05 '18 23:10

Abe Voelker


1 Answers

I'd like to note from the start that comparing relational and non-relation databases are not like-for-like comparison.

It is likely that non-relation database maintains some extra pre-calculated structures as the data is updated. This makes updates somewhat slower and requires more disk space. Pure relational schema that you use don't have anything extra, which makes updates as fast as possible and keeps disk usage to the minimum.

I'll focus on what could be done with the given schema.


At first I'd make a composite index

CREATE INDEX idx_user_friend_user_12 ON t_user_friend (user_1, user_2);

One such index should be enough.

Then, we know that there are only 1M users in total, so final result can't be more than 1M.

The 5-level query ends up generating 312.5M rows (50*50*50*50*50). This is way more than maximum possible result, which means that there are a lot of duplicates.

So, I'd try to materialize intermediate results and eliminate duplicates early in the process.

We know that Postgres materializes CTEs, so I'd try to use that.

Something like this:

WITH
CTE12
AS
(
    SELECT
        DISTINCT f2.user_2
    FROM
        t_user_friend f1 
        INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1
    WHERE
        f1.user_1 = 1
)
,CTE3
AS
(
    SELECT
        DISTINCT f3.user_2
    FROM
        CTE12
        INNER JOIN t_user_friend f3 ON CTE12.user_2 = f3.user_1
)
,CTE4
AS
(
    SELECT
        DISTINCT f4.user_2
    FROM
        CTE3
        INNER JOIN t_user_friend f4 ON CTE3.user_2 = f4.user_1
)
SELECT
    COUNT(DISTINCT f5.user_2) AS cnt
FROM
    CTE4
    INNER JOIN t_user_friend f5 ON CTE4.user_2 = f5.user_1
;

Most likely SELECT DISTINCT would require sorts, which would allow to use merge joins.


As far as I could understand from the execution plan for the query above https://explain.depesz.com/s/Sjov , Postgres is not smart enough and does some unnecessary sorts. Also, it uses hash aggregate for some SELECT DISTINCT, which requires extra sort.

So, the next attempt would be to use temporary tables with proper indexes for each step explicitly.

Also, I'd define the idx_user_friend_user_12 index as unique. It may provide an extra hint to optimizer.

It would be interesting to see how the following performs.

CREATE TABLE temp12
(
    user_2 bigint NOT NULL PRIMARY KEY
);
CREATE TABLE temp3
(
    user_2 bigint NOT NULL PRIMARY KEY
);
CREATE TABLE temp4
(
    user_2 bigint NOT NULL PRIMARY KEY
);

INSERT INTO temp12(user_2)
SELECT
    DISTINCT f2.user_2
FROM
    t_user_friend f1 
    INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1
WHERE
    f1.user_1 = 1
;

INSERT INTO temp3(user_2)
SELECT
    DISTINCT f3.user_2
FROM
    temp12
    INNER JOIN t_user_friend f3 ON temp12.user_2 = f3.user_1
;

INSERT INTO temp4(user_2)
SELECT
    DISTINCT f4.user_2
FROM
    temp3
    INNER JOIN t_user_friend f4 ON temp3.user_2 = f4.user_1
;

SELECT
    COUNT(DISTINCT f5.user_2) AS cnt
FROM
    temp4
    INNER JOIN t_user_friend f5 ON temp4.user_2 = f5.user_1
;

DROP TABLE temp12;
DROP TABLE temp3;
DROP TABLE temp4;

As an added bonus of explicit temp tables you can measure how much time each extra level takes.

like image 124
Vladimir Baranov Avatar answered Sep 19 '22 12:09

Vladimir Baranov