One of the claims Neo4j makes in their marketing is that relational databases aren't good at doing multi-level self-join queries:
I found the code repository corresponding to the book that claim is taken from, and translated it to Postgres:
CREATE TABLE t_user (
id bigserial PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE t_user_friend (
id bigserial PRIMARY KEY,
user_1 bigint NOT NULL REFERENCES t_user,
user_2 bigint NOT NULL REFERENCES t_user
);
CREATE INDEX idx_user_friend_user_1 ON t_user_friend (user_1);
CREATE INDEX idx_user_friend_user_2 ON t_user_friend (user_2);
/* Create 1M users, each getting a random 10-character name */
INSERT INTO t_user (id, name)
SELECT x.id, substr(md5(random()::text), 0, 10)
FROM generate_series(1,1000000) AS x(id);
/* For each user, create 50 random friendships for a total of 50M friendship records */
INSERT INTO t_user_friend (user_1, user_2)
SELECT g1.x AS user_1, (1 + (random() * 999999)) :: int AS user_2
FROM generate_series(1, 1000000) as g1(x), generate_series(1, 50) as g2(y);
And these are the queries at various depths Neo4j is comparing against:
/* Depth 2 */
SELECT
COUNT(DISTINCT f2.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
WHERE
f1.user_1 = 1;
/* Depth 3 */
SELECT
COUNT(DISTINCT f3.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
WHERE
f1.user_1 = 1;
/* Depth 4 */
SELECT
COUNT(DISTINCT f4.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
INNER JOIN
t_user_friend f4
ON f3.user_2 = f4.user_1
WHERE
f1.user_1 = 1;
/* Depth 5 */
SELECT
COUNT(DISTINCT f5.user_2) AS cnt
FROM
t_user_friend f1
INNER JOIN
t_user_friend f2
ON f1.user_2 = f2.user_1
INNER JOIN
t_user_friend f3
ON f2.user_2 = f3.user_1
INNER JOIN
t_user_friend f4
ON f3.user_2 = f4.user_1
INNER JOIN
t_user_friend f5
ON f4.user_2 = f5.user_1
WHERE
f1.user_1 = 1;
I was roughly able to reproduce the book's claimed results, getting these sorts of execution times against the 1M users, 50M friendships:
| Depth | Count(*) | Time (s) |
|-------|----------|----------|
| 2 | 2497 | 0.067 |
| 3 | 117301 | 0.118 |
| 4 | 997246 | 8.409 |
| 5 | 999999 | 214.56 |
(Here's an EXPLAIN ANALYZE of a depth 5 query)
My question is, is there a way to improve the performance of these queries to meet or exceed Neo4j's execution time of ~2s at depth level 5?
I tried with this recursive CTE:
WITH RECURSIVE chain(user_2, depth) AS (
SELECT t.user_2, 1 as depth
FROM t_user_friend t
WHERE t.user_1 = 1
UNION
SELECT t.user_2, c.depth + 1 as depth
FROM t_user_friend t, chain c
WHERE t.user_1 = c.user_2
AND depth < 4
)
SELECT COUNT(*)
FROM (SELECT DISTINCT user_2 FROM chain) AS temp;
However it's still pretty slow, with a depth 4 taking 5s and a depth 5 taking 48s (EXPLAIN ANALYZE)
I'd like to note from the start that comparing relational and non-relation databases are not like-for-like comparison.
It is likely that non-relation database maintains some extra pre-calculated structures as the data is updated. This makes updates somewhat slower and requires more disk space. Pure relational schema that you use don't have anything extra, which makes updates as fast as possible and keeps disk usage to the minimum.
I'll focus on what could be done with the given schema.
At first I'd make a composite index
CREATE INDEX idx_user_friend_user_12 ON t_user_friend (user_1, user_2);
One such index should be enough.
Then, we know that there are only 1M users in total, so final result can't be more than 1M.
The 5-level query ends up generating 312.5M rows (50*50*50*50*50). This is way more than maximum possible result, which means that there are a lot of duplicates.
So, I'd try to materialize intermediate results and eliminate duplicates early in the process.
We know that Postgres materializes CTEs, so I'd try to use that.
Something like this:
WITH
CTE12
AS
(
SELECT
DISTINCT f2.user_2
FROM
t_user_friend f1
INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1
WHERE
f1.user_1 = 1
)
,CTE3
AS
(
SELECT
DISTINCT f3.user_2
FROM
CTE12
INNER JOIN t_user_friend f3 ON CTE12.user_2 = f3.user_1
)
,CTE4
AS
(
SELECT
DISTINCT f4.user_2
FROM
CTE3
INNER JOIN t_user_friend f4 ON CTE3.user_2 = f4.user_1
)
SELECT
COUNT(DISTINCT f5.user_2) AS cnt
FROM
CTE4
INNER JOIN t_user_friend f5 ON CTE4.user_2 = f5.user_1
;
Most likely SELECT DISTINCT
would require sorts, which would allow to use merge joins.
As far as I could understand from the execution plan for the query above https://explain.depesz.com/s/Sjov , Postgres is not smart enough and does some unnecessary sorts. Also, it uses hash aggregate for some SELECT DISTINCT
, which requires extra sort.
So, the next attempt would be to use temporary tables with proper indexes for each step explicitly.
Also, I'd define the idx_user_friend_user_12
index as unique. It may provide an extra hint to optimizer.
It would be interesting to see how the following performs.
CREATE TABLE temp12
(
user_2 bigint NOT NULL PRIMARY KEY
);
CREATE TABLE temp3
(
user_2 bigint NOT NULL PRIMARY KEY
);
CREATE TABLE temp4
(
user_2 bigint NOT NULL PRIMARY KEY
);
INSERT INTO temp12(user_2)
SELECT
DISTINCT f2.user_2
FROM
t_user_friend f1
INNER JOIN t_user_friend f2 ON f1.user_2 = f2.user_1
WHERE
f1.user_1 = 1
;
INSERT INTO temp3(user_2)
SELECT
DISTINCT f3.user_2
FROM
temp12
INNER JOIN t_user_friend f3 ON temp12.user_2 = f3.user_1
;
INSERT INTO temp4(user_2)
SELECT
DISTINCT f4.user_2
FROM
temp3
INNER JOIN t_user_friend f4 ON temp3.user_2 = f4.user_1
;
SELECT
COUNT(DISTINCT f5.user_2) AS cnt
FROM
temp4
INNER JOIN t_user_friend f5 ON temp4.user_2 = f5.user_1
;
DROP TABLE temp12;
DROP TABLE temp3;
DROP TABLE temp4;
As an added bonus of explicit temp tables you can measure how much time each extra level takes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With