Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between these two Cypher queries?

I'm a bit stumped.

In my database, I have a relationship like this:

(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)

I want to perform a query where for a given user, I find the common artists between that user and every other user.

To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.

The first query I was trying was the following:

START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")

OPTIONAL MATCH 
    pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
    other:User

WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists

ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists

This was taking around 20 seconds to return. The profile for this query is as follows:

+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
|             Operator |  Rows | DbHits |            Identifiers |                                                                                          Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
|      ColumnFilter(0) |    10 |      0 |                        |                                                     keep columns other.username, mutualArtists |
|              Extract |    10 |     20 |                        |                                                                                 other.username |
|      ColumnFilter(1) |    10 |      0 |                        |                                                              keep columns other, mutualArtists |
|                  Top |    10 |      0 |                        | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
|     EagerAggregation |   563 |      0 |                        |                                                                                          other |
|        OptionalMatch | 52806 |      0 |                        |                                                                                                |
|             Eager(0) |   563 |      0 |                        |                                                                                                |
|  NodeByIndexQuery(1) |   563 |    564 |           other, other |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |                 me, me |                                                                          Literal(List(553314)) |
|             Eager(1) |    82 |      0 |                        |                                                                                                |
|          ExtractPath |    82 |      0 |         pMutualArtists |                                                                                                |
|            Filter(0) |    82 |     82 |                        |                                                    (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher |    82 |     82 | a, me, ar2, ar1, other |                                                                                                |
|            Filter(1) |     1 |      3 |                        |               ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
|  NodeByIndexQuery(1) |   563 |    564 |           other, other |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |                 me, me |                                                                          Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+

I was frustrated. It didn't seem like this should take 20 seconds.

I came back to the problem later on, and tried debugging it from the start.

I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.

I finally added things back, and ended up with the following query:

START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")

OPTIONAL MATCH 
    pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
    u2:User

WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists

ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists

This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:

+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
|             Operator |  Rows | DbHits |        Identifiers |                                                                                          Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
|      ColumnFilter(0) |    10 |      0 |                    |                                                        keep columns u2.username, mutualArtists |
|              Extract |    10 |     20 |                    |                                                                                    u2.username |
|      ColumnFilter(1) |    10 |      0 |                    |                                                                 keep columns u2, mutualArtists |
|                  Top |    10 |      0 |                    | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
|     EagerAggregation |   563 |      0 |                    |                                                                                             u2 |
|        OptionalMatch | 52806 |      0 |                    |                                                                                                |
|             Eager(0) |   563 |      0 |                    |                                                                                                |
|  NodeByIndexQuery(1) |   563 |    564 |             u2, u2 |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |               u, u |                                                                          Literal(List(553314)) |
|             Eager(1) |    82 |      0 |                    |                                                                                                |
|          ExtractPath |    82 |      0 |     pMutualArtists |                                                                                                |
|            Filter(0) |    82 |     82 |                    |                                                    (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher |    82 |     82 | a, u2, u, ar2, ar1 |                                                                                                |
|            Filter(1) |     1 |      3 |                    |                      ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
|  NodeByIndexQuery(1) |   563 |    564 |             u2, u2 |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |               u, u |                                                                          Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+

And, to prove that I ran them both in a row and got very different results:

neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
>     pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
>     u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username  | mutualArtists |
+------------------------------+
| "573904765"  | 644           |
| "28600291"   | 601           |
| "1092510304" | 558           |
| "1367963461" | 521           |
| "1508790199" | 455           |
| "1335360028" | 447           |
| "18200866"   | 444           |
| "1229430376" | 435           |
| "748318333"  | 434           |
| "5612902"    | 431           |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
>     pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
>     other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765"    | 644           |
| "28600291"     | 601           |
| "1092510304"   | 558           |
| "1367963461"   | 521           |
| "1508790199"   | 455           |
| "1335360028"   | 447           |
| "18200866"     | 444           |
| "1229430376"   | 435           |
| "748318333"    | 434           |
| "5612902"      | 431           |
+--------------------------------+
10 rows
20418 ms

Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").

Why does that cause a 5X improvement??!?!

If anyone has any insight into this, I would be eternally grateful.

Thanks,

-Adam


EDIT 8.1.14

Based on @ulkas's suggestion, I tried simplifying the query.

The results were:

START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10

~4 seconds

START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10

~20 seconds

So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.

Thanks, -Adam

like image 798
Adam Miskiewicz Avatar asked Nov 01 '22 19:11

Adam Miskiewicz


2 Answers

That sounds like you're seeing the effect of caching. Upon the first access the cache is not populated. Subsequent queries hitting the same graph will be much faster since the nodes/relationships are already available in the cache.

like image 88
Stefan Armbruster Avatar answered Dec 30 '22 11:12

Stefan Armbruster


working with OPTIONAL MATCH following WHERE other:User has no sense, since the end node other (u2) must be match. try to perform the queries without optional match and where and without the last with, simply

START me=node(553314), other=node:userLocations("withinDistance[38.89037,-77.03196,80.467]")
 MATCH
     pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)   
 RETURN other.username, count(DISTINCT pMutualArtists) as mutualArtists
 ORDER BY mutualArtists DESC
 LIMIT 10
like image 42
ulkas Avatar answered Dec 30 '22 10:12

ulkas