Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neo4j 1.9.9 legacy index very slow after deletes

Using Neo4j 1.9.9. Some Cypher queries we were running seemed to be unreasonably slow. Some investigation showed that:

  • Delete 200k nodes takes about 2-3 seconds on my hardware (MacBook Pro), when I select them using:

    START n=node(*) DELETE n
    
  • Adding a WHERE clause does not significantly slow it down

  • If the nodes were selected using an index, it has similar performance, e.g.

    START n=node:__types__(className="com.e2sd.domain.Comment") DELETE n
    
  • Except that when repeating the previous test, it is 20x or more slower, with actual time varying from 80 to several hundred seconds. Even more curious, it doesn't matter whether I repeat the test in the same JVM or start a new program, or clear out all the nodes in the database and verify it has zero nodes. The index-based delete is extremely slow on any subsequent run of the test until I clobber my neo4j data directory with

     rm -R target/neo4j-test/
    

I'll give some example Scala code here. I'm happy to provide more detail as required.

for (j <- 1 to 3) {
  log("Total nodes in database: " + inNeo4j( """ START n=node(*) RETURN COUNT(n) """).to(classOf[Int]).single)
  log("Start")
  inNeo4j(""" CREATE (x) WITH x FOREACH(i IN RANGE(1, 200000, 1) : CREATE ({__type__: "com.e2sd.domain.Comment"})) """)
  rebuildTypesIndex()
  log("Created lots of nodes")
  val x = inNeo4j(
    """
    START n=node:__types__(className="com.e2sd.domain.Comment")
    DELETE n
    RETURN COUNT(n)
    """).to(classOf[Int]).single
  log("Deleted x nodes: " + x)
}

// log is a convenience method that prints a string and the time since the last log
// inNeo4j is a convenience method to run a Cypher query



def rebuildTypesIndex(): Unit = {
  TransactionUtils.withTransaction(neo4jTemplate) {
    log.info("Rebuilding __types__ index...")
    val index = neo4jTemplate.getGraphDatabase.getIndex[Node]("__types__")
    for (node <- GlobalGraphOperations.at(neo4jTemplate.getGraphDatabaseService).getAllNodes.asScala) {
      index.remove(node)
      if (node.hasProperty("__type__")) {
        val typeProperty = node.getProperty("__type__")
        index.add(node, "className", typeProperty)
      }
    }
    log.info("Done")
  }
}

We are using Neo4j embedded here with the following Spring Data configuration.

<bean id="graphDbFactory" class="org.neo4j.graphdb.factory.GraphDatabaseFactory"/>
<bean id="graphDatabaseService" scope="singleton" destroy-method="shutdown"
  factory-bean="graphDbFactory" factory-method="newEmbeddedDatabase">
  <constructor-arg value="target/neo4j-test"/>
</bean>
<neo4j:config graphDatabaseService="graphDatabaseService" base-package="my.package.*"/>

Why is the DELETE query slow under the conditions described?

like image 593
Jonathan Crosmer Avatar asked Nov 01 '22 09:11

Jonathan Crosmer


1 Answers

You have to specifically delete entries from the legacy index, deleting nodes is not enough to make it remove from a legacy index. Thus, when you run it the second time, you have 400k entries in your index, even though half of them point to deleted nodes. In this way your program is slow as repeated runs extend the size of the index.

I had this problem when I wrote an extension to neo4j spatial to bulk load the RTree. I had to use the Java API you had to explicitly delete from the index separately from deleting the node. Glad I could help.

like image 156
phil_20686 Avatar answered Jan 04 '23 13:01

phil_20686