I'm evaluating Neo4j for our application, and now am at a point where performance is an issue. I've created a lot of nodes and edges that I'm doing some queries against. The following is a detail of the nodes and edges data in this database:

I am trying to do a search that traverses the yellow arrows of this diagram. What I have so far is the following query:
MATCH (n:LABEL_TYPE_Project {id:'14'})
-[:RELATIONSHIP_scopes*1]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy*1]->(o:LABEL_TYPE_Item)
WHERE m.id IN ['1', '2', '6', '12', '12064', '19614', '19742', '19863', '21453', '21454', '21457', '21657', '21658', '31123', '31127', '31130', '47691', '55603', '55650', '56026', '56028', '56029', '56050', '56052', '85383', '85406', '85615', '105665', '1035242', '1035243']
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
(The variable paths above have been updated, see "Update 2")
The above query takes a barely-acceptable 1200ms. It only returns the requested 20 items. If I want a count of the same, this takes forever:
MATCH ... more of the same ...
RETURN count(o)
The above query takes many minutes. This is Neo4j 2.2.0-M03 Community running on CentOS. There is around 385,000 nodes, 170,000 of type Item.
I have created indices on all id fields (programmatically, index().forNodes(...).add(...)), also on the content field (CREATE INDEX ... statement).
Are there fundamental improvements yet to be made to my queries? Things I can try?
Much appreciated.
This question was moved over from Neo4j discussion group on Google per their suggestions.
As requested:
:schema
Gives:
Indexes
ON :LABEL_TYPE_Item(id) ONLINE
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_PermissionNode(id) ONLINE
ON :LABEL_TYPE_Project(id) ONLINE
No constraints
(This is updated, see "Update 2")
I have made the following noteworthy improvements to the query:
TYPE_Projects (not by design, just messed up the importing algorithm that I was using) and I removed it nowAs a consequence:
:schema
Now gives:
Indexes
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_Item(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_PermissionNode(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_Project(id) ONLINE (for uniqueness constraint)
Constraints
ON (label_type_item:LABEL_TYPE_Item) ASSERT label_type_item.id IS UNIQUE
ON (label_type_project:LABEL_TYPE_Project) ASSERT label_type_project.id IS UNIQUE
ON (label_type_permissionnode:LABEL_TYPE_PermissionNode) ASSERT label_type_permissionnode.id IS UNIQUE
The query now looks like this:
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
The above query now takes approx. 350ms.
I still want a count of the same:
MATCH ...
RETURN count(0)
The above query now takes approx. 1100ms. Although that's much better, and barely acceptable for this particular query, I've already found some more-complex queries that inherently take longer. So a further improvement on this query here would be great.
As requested here is the PROFILE for RETURN o query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Projection
|
+Limit
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection | 1900 | 20 | 0 | m, n, o | o |
| Limit | 1900 | 20 | 0 | m, n, o | { AUTOINT32} |
| Filter(0) | 1900 | 20 | 131925 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 43975 | 43993 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 18 | 614 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 18 | 19 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And here is the PROFILE for RETURN count(o) query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Limit
|
+EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit | 44 | 1 | 0 | count(o) | { AUTOINT32} |
| EagerAggregation | 44 | 1 | 0 | count(o) | |
| Filter(0) | 1900 | 101 | 440565 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 146855 | 146881 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 26 | 850 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 26 | 27 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Remaining suggestions:
MATCH ... WITH x MATCH ...->(x) syntax: this did not help me at all, so farI've been playing with full-text search, and indexed the content property as follows:
IndexManager indexManager = getGraphDb().index();
Map<String, String> customConfiguration = MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "type", "fulltext");
Index<Node> index = indexManager.forNodes("INDEX_FULL_TEXT_content_Item", customConfiguration);
index.add(node, "content", value);
When I run the following query this takes approx. 1200ms:
START o=node:INDEX_FULL_TEXT_content_Item("content:*some string*")
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
RETURN count(o);
Here is the PROFILE for this query:
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+NodeHashJoin
|
+Filter(1)
| |
| +NodeByIndexQuery
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| EagerAggregation | 50 | 1 | 0 | count(o) | |
| Filter(0) | 2533 | 166 | 498 | m, n, o | (Property(n,id(0)) == { AUTOINT0} AND hasLabel(n:LABEL_TYPE_Project)) |
| Expand(All)(0) | 32933 | 166 | 332 | m, n, o | (m)<-[:RELATIONSHIP_scopes]-(n) |
| NodeHashJoin | 32933 | 166 | 0 | m, o | o |
| Filter(1) | 1 | 553 | 553 | o | hasLabel(o:LABEL_TYPE_Item) |
| NodeByIndexQuery | 1 | 553 | 554 | o | Literal(content:*itzndby*); INDEX_FULL_TEXT_content_Item |
| Expand(All)(1) | 64914 | 146855 | 146881 | m, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| NodeUniqueIndexSeek | 27 | 26 | 30 | m | :LABEL_TYPE_PermissionNode(id) |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
Things to think about/try: in general with query optimization, the #1 name of the game is to figure out ways to consider less data in the first place, in the answering of the query. It's far less fruitful to consider the same data faster, than it is to consider less data.
content fields. My understanding is that regex you're doing isn't narrowing cypher's search path any, so it's basically having to look at every o:LABEL_TYPE_Item and run the regex against that field. Your regex is only looking for a substring though, so lucene may help cut down the number of nodes cypher has to consider before it can give you a result.-[:RELATIONSHIP_scopes*1]->) yet the image you give us suggests you only ever need one hop. On both relationship hops, depending on how your graph is structured (and how much data you have) you might be looking through way more information than you need to there. Consider those relationship hops and your data model carefully; can you replace with -[:RELATIONSHIP_scopes]-> instead? Note that you have a WHERE clause on m nodes, you may be traversing more of those than required.PROFILE, google for docs). One trick I see a lot of people using is pushing the most restrictive part of their query to the top, in front of a WITH block. This reduces the number of "starting points". What I mean is taking this query...
MATCH (foo)-[:stuff*]->(bar) // (bunch of other complex stuff)
WHERE bar.id = 5
RETURN foo
And turning it into this:
MATCH bar
WHERE bar.id = 5
WITH bar
MATCH (foo)-[:stuff*]->(bar)
RETURN foo;
(Check output via PROFILE, this trick can be used to force the query execution plan to do the most selective thing first, drastically reducing the amount of the graph that cypher considers/traverses...better performance)
(foo:SpecialIndex { label: "My Nifty Index" }). This is akin to a "view" in a relational database. You link the stuff you want to access quickly to foo. Then your query, instead of having that big WHERE id IN [blah blah] clause, it simply looks up foo:SpecialIndex, traverses to the hit points, then goes from there. This trick works well when the list of entry points in your list of IDs is large, rapidly growing, or both. This keeps all the same computation you'd do normally, but shifts some of it to be done ahead of time so you don't do it every time you run the query.can you share your output of :schema in the browser?
if you don't have it do:
create constraint on (p:LABEL_TYPE_Project) assert p.id is unique;
create constraint on (m:LABEL_TYPE_PermissionNode) assert m.id is unique;
The manual indexes you created only help for Item.content if you index it with FULLTEXT_CONFIG and then use START o=node:items("content:(some string)") MATCH ...
As in Neo4j you can always traverse relationships in both directions, you don't need the inverse relationships, it only hurts performance because queries then tend to check one cycle more.
You don't need variable length paths [*1] in your query, change it to:
MATCH (n:LABEL_TYPE_Project {id:'14'})-[:RELATIONSHIP_scopes]->
(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in ['1', '2', ... '1035242', '1035243']
AND o.content =~ '.*itzndby.*' RETURN o LIMIT 20
For real queries use parameters, for project-id and permission.id ->
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in {m_ids} AND o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
remember a realistic query performance only shows up on a warmed up system, so run the query at least twice
you might also want to split up your query
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
WHERE m.id in {m_ids}
WITH distinct m
MATCH (m)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
Also learn about PROFILE you can prefix your query it in the old webadmin: http://localhost:7474/webadmin/#/console/
If you use Neo4j 2.2-M03 there is built in support for query plan visualization with EXPLAIN and PROFILE prefixes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With