Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Graph databases and RDF triplestores: storage of graph data in python

I need to develop a graph database in python (I would enjoy if anybody can join me in the development. I already have a bit of code, but I would gladly discuss about it).

I did my research on the internet. in Java, neo4j is a candidate, but I was not able to find anything about actual disk storage. In python, there are many graph data models (see this pre-PEP proposal, but none of them satisfy my need to store and retrieve from disk.

I do know about triplestores, however. triplestores are basically RDF databases, so a graph data model could be mapped in RDF and stored, but I am generally uneasy (mainly due to lack of experience) about this solution. One example is Sesame. Fact is that, in any case, you have to convert from in-memory graph representation to RDF representation and viceversa in any case, unless the client code wants to hack on the RDF document directly, which is mostly unlikely. It would be like handling DB tuples directly, instead of creating an object.

What is the state-of-the-art for storage and retrieval (a la DBMS) of graph data in python, at the moment? Would it make sense to start developing an implementation, hopefully with the help of someone interested in it, and in collaboration with the proposers for the Graph API PEP ? Please note that this is going to be part of my job for the next months, so my contribution to this eventual project is pretty damn serious ;)

Edit: Found also directededge, but it appears to be a commercial product

like image 645
Stefano Borini Avatar asked Aug 19 '09 23:08

Stefano Borini


People also ask

How does RDF store its data?

RDFStore implements a generic hashed data storage that allows to serialise RDF models, resources, properties and property values either to disk or in-memory data structures. It does support several different persistent storage models such as SDBM, BerkeleyDB (standard and Sleepycat) and DBMS.

Is RDF a graph database?

The RDF triplestore is a type of graph database that stores semantic facts. RDF, which stands for Resource Description Framework, is a model for data publishing and interchange on the Web standardized by W3C. Being a graph database, triplestores store data as a network of objects with materialized links between them.

How are graph databases stored?

Graph data is kept in store files, each of which contain data for a specific part of the graph, such as nodes, relationships, labels and properties. Dividing the storage in this way facilitates highly performant graph traversals (as detailed above).

What is graph database in python?

Graph Databases is a NoSQL database based on Graph Theory and it consists of objects called nodes, properties, and edges (relationships) to represent, store, and search the relationships of data. Graph databases treat the relationships between nodes equally important as data itself.


1 Answers

I have used both Jena, which is a Java framework, and Allegrograph (Lisp, Java, Python bindings). Jena has sister projects for storing graph data and has been around a long, long time. Allegrograph is quite good and has a free edition, I think I would suggest this cause it is easy to install, free, fast and you could be up and going in no time. The power you would get from learning a little RDF and SPARQL may very well be worth your while. If you know SQL already then you are off to a great start. Being able to query your graph using SPARQL would yield some great benefits to you. Serializing to RDF triples would be easy, and some of the file formats are super easy ( NT for instance ). I'll give an example. Lets say you have the following graph node-edge-node ids:

1 <- 2 -> 3
3 <- 4 -> 5

these are already subject predicate object form so just slap some URI notation on it, load it in the triple store and query at-will via SPARQL. Here it is in NT format:

<http://mycompany.com#1> <http://mycompany.com#2> <http://mycompany.com#3> .
<http://mycompany.com#3> <http://mycompany.com#4> <http://mycompany.com#5> .

Now query for all nodes two hops from node 1:

SELECT ?node
WHERE {
    <http://mycompany.com#1> ?p1 ?o1 .
    ?o1 ?p2 ?node .
}

This would of course yield <http://mycompany.com#5>.

Another candidate would be Mulgara, written in pure Java. Since you seem more interested in Python though I think you should take a look at Allegrograph first.

like image 183
harschware Avatar answered Sep 17 '22 17:09

harschware