What is the fastest ArangoDB friends-of-friends query (with count)

Tags:

I'm trying to use ArangoDB to get a list of friends-of-friends. Not just a basic friends-of-friends list, I also want to know how many friends the user and the friend-of-a-friend have in common and sort the result. After several attempts at (re)writing the best performing AQL query, this is what I ended up with:

LET friends = (
  FOR f IN GRAPH_NEIGHBORS('graph', @user, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
  RETURN f._id
)

LET foafs = (FOR friend IN friends
  FOR foaf in GRAPH_NEIGHBORS('graph', friend, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
    FILTER foaf._id != @user AND foaf._id NOT IN friends
    COLLECT foaf_result = foaf WITH COUNT INTO common_friend_count
    RETURN {
      user: foaf_result,
      common_friend_count: common_friend_count
    }
)
FOR foaf IN foafs
  SORT foaf.common_friend_count DESC
  RETURN foaf

Unfortunately, performance is not as good as I would've liked. Compared to the Neo4j versions of the same query(and data), AQL seems quite a bit slower (5-10x).

What I'd like to know is... How can I improve our query to make it perform better?

605

asked Oct 22 '15 11:10

Terry Seidler

Video Answer

1 Answers

I am one of the core developers of ArangoDB and tried to optimize your query. As I do not have your dataset I can only talk about my test dataset and would be happy to hear if you can validate my results.

First if all I am running on ArangoDB 2.7 but in this particular case I do not expect a major performance difference to 2.6.

In my dataset I could execute your query as it is in ~7sec. First fix: In your friends statement you use includeData: true and only return the _id. With includeData: false GRAPH_NEIGHBORS directly returns the _id and we can also get rid of the subquery here

LET friends = GRAPH_NEIGHBORS('graph', 
                              @user,
                              {"direction": "any",
                               "edgeExamples": { 
                                   name: "FRIENDS_WITH"
               }})

This got it down to ~ 1.1 sec on my machine. So I expect that this will be close to the performance of Neo4J.

Why does this have a high impact? Internally we first find the _id value without actually loading the documents JSON. In your query you do not need any of this data, so we can safely continue with not opening it.

But now for the real improvement

Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf is found and sorts it. This has to build up the complete foaf network in memory and sort it as a whole.

You can also do it in a different way: 1. Find all friends of user (only _ids) 2. Find all foaf (complete document) 3. For each foaf find all foaf_friends (only _ids) 4. Find the intersection of friends and foaf_friends and COUNT them

This query would like this:

LET fids = GRAPH_NEIGHBORS("graph",
                           @user,
                           {
                             "direction":"any",
                             "edgeExamples": {
                               "name": "FRIENDS_WITH"
                              }
                           }
                          )
FOR foaf IN GRAPH_NEIGHBORS("graph",
                            @user,
                            {
                              "minDepth": 2,
                              "maxDepth": 2,
                              "direction": "any",
                              "includeData": true,
                              "edgeExamples": {
                                "name": "FRIENDS_WITH"
                              }
                            }
                           )
  LET commonIds = GRAPH_NEIGHBORS("graph",
                                  foaf._id, {
                                    "direction": "any",
                                    "edgeExamples": {
                                      "name": "FRIENDS_WITH"
                                     }
                                  }
                                 )
  LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
  SORT common_friend_count DESC
  RETURN {user: foaf, common_friend_count: common_friend_count}

Which in my test graph was executed in ~ 0.024 sec

So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset I can not verify it, it would be good if you could do it and tell me.

One last thing

With the edgeExamples: {name : "FRIENDS_WITH" } it is the same as with includeData, in this case we have to find the real edge and look into it. This could be avoided if you store your edges in separate collections based on their name. And then remove the edgeExamples as well. This will further increase the performance (especially if there are a lot of edges).

Future

Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost.

121

answered Sep 28 '22 09:09

mchacki

Related questions
                            
                                Google Prediction API vs Graph Databases for Generated Recommendations?
                            
                                Graph Database Design Methodologies
                            
                                Gremlin : Multiple filter condition "OR"
                            
                                Why would index nodes or an indexed property be better in a graph database?
                            
                                Neo4j vs Apache Giraph in graph traversal
                            
                                Performance of arbitrary queries with Neo4j
                            
                                What can an RDBMS do that Neo4j (and graph databases) cant?
                            
                                Storing very large graphs on disk/streaming graph partitioning algorithms?
                            
                                Rails 3 and graph databases
                            
                                Is it possible to visualize the output of a graph query (Gremlin or SPARQL) as nodes and edges in Amazon Neptune? [closed]
                            
                                Hold entire Neo4j graph database in RAM?
                            
                                Simple way to delete a relationship by ID in Neo4j Cypher?
                            
                                How to optimize graph traversals in ArangoDB?
                            
                                Graph Database: TinkerPop/Blueprints vs W3C Linked data
                            
                                What is the difference between Orientdb editions?
                            
                                How to store graph data in a database?
                            
                                Storing multiple graphs in Neo4J
                            
                                Loading a Neo4j subgraph into Networkx
                            
                                Graph databases and RDF triplestores: storage of graph data in python
                            
                                How to Model Real-World Relationships in a Graph Database (like Neo4j)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the fastest ArangoDB friends-of-friends query (with count)

Tags:

graph-databases

arangodb

aql

Terry Seidler

People also ask

Video Answer

1 Answers

mchacki

Recent Activity

Donate For Us