Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting data for d3 from ArangoDB using AQL (or arangojs)

I'm building an app based around a d3 force-directed graph with ArangoDB on the backend, and I want to be able to load node and link data dynamically from Arango as efficiently as possible.

I'm not an expert in d3, but in general the force layout seems to want the its data as an array of nodes and an array of links that have the actual node objects as their sources and targets, like so:

var nodes = [
        {id: 0, reflexive: false},
        {id: 1, reflexive: true },
        {id: 2, reflexive: false}
    ],
    links = [
        {source: nodes[0], target: nodes[1], left: false, right: true },
        {source: nodes[1], target: nodes[2], left: false, right: true }
    ];

Currently I am using the following AQL query to get neighboring nodes, but it is quite cumbersome. Part of the difficulty is that I want to include edge information for nodes even when those edges are not traversed (in order to display the number of links a node has before loading those links from the database).

LET docId = "ExampleDocClass/1234567"

 // get data for all the edges
LET es = GRAPH_EDGES('EdgeClass',docId,{direction:'any',maxDepth:1,includeData:true})

// create an array of all the neighbor nodes
LET vArray = ( 
    FOR v IN GRAPH_TRAVERSAL('EdgeClass',docId[0],'any',{ maxDepth:1})
        FOR v1 IN v RETURN v1.vertex
    )

// using node array, return inbound and outbound for each node 
LET vs = (
    FOR v IN vArray
        // inbound and outbound are separate queries because I couldn't figure out
        // how to get Arango to differentiate inbout and outbound in the query results
        LET oe = (FOR oe1 IN GRAPH_EDGES('EdgeClass',v,{direction:'outbound',maxDepth:1,includeData:true}) RETURN oe1._to)
        LET ie = (FOR ie1 IN GRAPH_EDGES('EdgeClass',v,{direction:'inbound',maxDepth:1,includeData:true}) RETURN ie1._from)
        RETURN {'vertexData': v, 'outEdges': oe, 'inEdges': ie}
    )
RETURN {'edges':es,'vertices':vs}

The end output looks like this: http://pastebin.com/raw.php?i=B7uzaWxs ...which can be read almost directly into d3 (I just have to deduplicate a bit).

My graph nodes have a large amount of links, so performance is important (both in terms of load on the server and client, and file size for communication between the two). I am also planning on creating a variety of commands to interact with the graph aside from simply expanding neighboring nodes. Is there a way to better structure this AQL query (e.g. by avoiding four separate graph queries) or avoid AQL altogether using arangojs functions or a FOXX app, while still structuring the response in the format I need for d3 (including link data with each node)?

like image 368
ropeladder Avatar asked Nov 22 '15 14:11

ropeladder


1 Answers

sorry for the late reply, we were busy building v2.8 ;) I would suggest to do as many things as possible on the database side, as copying and serializing/deserializing JSON over the network is typically expensive, so transferring as little data as possible should be a good aim.

First of all i have used your query and executed it on a sample dataset i created (~ 800 vertices and 800 edges are hit in my dataset) As a baseline i used the execution time of your query which in my case was ~5.0s

So i tried to create the exact same result as you need in AQL only. I have found some improvements in your query: 1. GRAPH_NEIGHBORS is a bit faster than GRAPH_EDGES. 2. If possible avoid {includeData: true} if you do not need the data Especially if you need to/from vertices._id only GRAPH_NEIGHBORS with {includeData: false} outperforms GRAPH_EDGES by an order of magnitude. 3. GRAPH_NEIGHBORS is deduplicated, GRAPH_EDGES is not. Which in your case seems to be desired. 3. You can get rid of a couple of subqueries there.

So here is the pure AQL query i could come up with:

LET docId = "ExampleDocClass/1234567"
LET edges = GRAPH_EDGES('EdgeClass',docId,{direction:'any',maxDepth:1,includeData:true})
LET verticesTmp = (FOR v IN GRAPH_NEIGHBORS('EdgeClass', docId, {direction: 'any', maxDepth: 1, includeData: true})
  RETURN {
    vertexData: v,
    outEdges: GRAPH_NEIGHBORS('EdgeClass', v, {direction: 'outbound', maxDepth: 1, includeData: false}),
    inEdges: GRAPH_NEIGHBORS('EdgeClass', v, {direction: 'inbound', maxDepth: 1, includeData: false})
  })
LET vertices = PUSH(verticesTmp, {
  vertexData: DOCUMENT(docId),
  outEdges: GRAPH_NEIGHBORS('EdgeClass', docId, {direction: 'outbound', maxDepth: 1, includeData: false}),
  inEdges: GRAPH_NEIGHBORS('EdgeClass', docId, {direction: 'inbound', maxDepth: 1, includeData: false})
})
RETURN { edges, vertices }

This yields the same result format as your query and has the advantage that every vertex connected to docId is stored exactly once in vertices. Also docId itself is stored exactly once in vertices. No deduplication required on client side. But, in outEdges / inEdges of each vertices all connected vertices are also exactly once, i do not know if you need to know if there are multiple edges between vertices in this list as well.

This query uses ~0.06s on my dataset.

However if you put some more effort into it you could also consider to use a hand-crafted traversal inside a Foxx application. This is a bit more complicated but might be faster in your case, as you do less subqueries. The code for this could look like the following:

var traversal = require("org/arangodb/graph/traversal");
var result = {
  edges: [],
  vertices: {}
}
var myVisitor = function (config, result, vertex, path, connected) {
  switch (path.edges.length) {
    case 0:
      if (! result.vertices.hasOwnProperty(vertex._id)) {
        // If we visit a vertex, we store it's data and prepare out/in
        result.vertices[vertex._id] = {
          vertexData: vertex,
          outEdges: [],
          inEdges: []
        };
      }

      // No further action
      break;
    case 1:
      if (! result.vertices.hasOwnProperty(vertex._id)) {
        // If we visit a vertex, we store it's data and prepare out/in
        result.vertices[vertex._id] = {
          vertexData: vertex,
          outEdges: [],
          inEdges: []
        };
      }
      // First Depth, we need EdgeData
      var e = path.edges[0];
      result.edges.push(e);
      // We fill from / to for both vertices
      result.vertices[e._from].outEdges.push(e._to);
      result.vertices[e._to].inEdges.push(e._from);
      break;
    case 2:
      // Second Depth, we do not need EdgeData
      var e = path.edges[1];
      // We fill from / to for all vertices that exist
      if (result.vertices.hasOwnProperty(e._from)) {
        result.vertices[e._from].outEdges.push(e._to);
      }
      if (result.vertices.hasOwnProperty(e._to)) {
        result.vertices[e._to].inEdges.push(e._from);
      }
      break;
  }
};
var config = {
  datasource: traversal.generalGraphDatasourceFactory("EdgeClass"),
  strategy: "depthfirst",
  order: "preorder",
  visitor: myVisitor,
  expander: traversal.anyExpander,
  minDepth: 0,
  maxDepth: 2
};
var traverser = new traversal.Traverser(config);
traverser.traverse(result, {_id: "ExampleDocClass/1234567"});
return {
  edges: result.edges,
  vertices: Object.keys(result.vertices).map(function (key) {
              return result.vertices[key];
            })
};

The idea of this traversal is to visit all vertices from the start vertex to up to two edges away. All vertices in 0 - 1 depth will be added with data into the vertices object. All edges originating from the start vertex will be added with data into the edges list. All vertices in depth 2 will only set the outEdges / inEdges in the result.

This has the advantage that, vertices is deduplicated. and outEdges/inEdges contain all connected vertices multiple times, if there are multiple edges between them.

This traversal executes on my dataset in ~0.025s so it is twice as fast as the AQL only solution.

hope this still helps ;)

like image 102
mchacki Avatar answered Oct 03 '22 03:10

mchacki