I have an RDF-like graph data structure, i.e. consisting of nodes (entities) which are connected by edges (properties, relations) of different kinds. The user will select a node in that graph (millions of nodes, hundreds of millions of edges), and I am looking for a fast way to display the "proximity" of the selected node (i.e. one or two levels of nodes from which there is a path via a set of possibly specified relations to the originally selected node).
I have done some research and come across RDF-specialized triple stores and more general graph databases like neo4j and allegro. Then there are also middleware products like jena and sesame.
Would you recommend a triple store or a graph database for making querying for nearby connected nodes efficient? Do middlewares play a role here? I understand that in each case, holding the complete graph in memory is probably going to be advantageous.
Alexander
I would recommend one of the RDF stores (Jena, Sesame, 4store, Virtuoso, OWLim, Oracle etc.). Then you can just learn the SPARQL query for your solution, and try it in a variety of systems without having to code for different APIs.
There are a couple of approaches you could take, the easiest is robably a UNION query with the different paths, you can use a variable for the edge URI and add a FILTER to limit it to just the ones you're interested in.
To clarify, I would not classify Jena and/or Sesame as middleware. They both have native storage and indexes.
Jena has TDB which uses B+Tree indexes. In particular for the default graph, you have three indexes: SPO, POS and OSP.
In your case, the SPO index will be used to give you all the triples for a given subject. If you want two levels deep you'll need to touch the index multiple times: one for the initial subject and one for each of the objects corrected to your subject.
TDB uses memory mapped files to cache your indexes so if you have enough RAM it shouldn't be a problem.
What you want to do is very close to what people in the RDF community used to call Concise Bounded Description (CBD), however if you want two or more level deep you'll need to implement that yourself. SPARQL query language gives you a DESCRIBE you can use (but it's one level deep).
Last but not least, you say you have an RDF-like graph data structure, but it's not RDF. For this reason, you should either convert your data in RDF or give up on the idea of using a triple store, since they are designed to load and manage RDF data. Even if you can actually use just part of the storage and indexing layer to build and use your own custom indexes.
The best thing for your is to make an experiment with your data and compare how different solutions work with your use case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With