I'm evaluating a nosql solution for implement a file system like structure, with millions of items, where key features have to be:
having this requirements i split the problem in 2 task:
Now the power of nosql schema free is a good feature for store different properties for each file, and this is good for point 2.
I have instead some doubt over point 1 about the pros / cons to use a document database (example mongodb) with a single collection of items and a materialized path design pattern, or using a graph database (example arangodb) with 2 collection: items for data (document collection), and itemsParents for parent-child relation (edge collection) and a graph traverse function.
There are advantages in performance using a graph database for my requirements?
graph traverse is more efficient over materialized path filter to accomplish my task?
If yes, can you explain me why?
Thanks
Graph data modeling is the process in which a user describes an arbitrary domain as a connected graph of nodes and relationships with properties and labels.
Neo4j is an OLTP graph database which excels at querying data relationships, which is a weakness of other NoSQL and SQL solutions. We created the Neo4j Doc Manager for Mongo Connector to allow MongoDB developers to store JSON data in Mongo while querying the relationships between the data using Neo4j.
Graph databases have advantages for use cases such as social networking, recommendation engines, and fraud detection, when you need to create relationships between data and quickly query these relationships. The following graph shows an example of a social network graph.
A graph database would certainly be a great choice for a hierarchical structure like a filesystem. Speaking specifically of Neo4j you could have a schema such as:
(:Folder)-[:IN_FOLDER]->(:Folder)
(:File)-[:IN_FOLDER]->(:Folder)
Finding a file or a folder is as simple as the following Cypher:
MATCH (file:File {path: '/dir/file'})
RETURN file
To find all of the files/folders directly under a folder:
MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER]-(file_or_folder)
RETURN file_or_folder
If you wanted to find all files/folders recursively you could do:
MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER*1..5]-(file_or_folder)
RETURN file_or_folder
The 1..5
adjusts the depth (from one to five levels) which you are searching.
For all of these you'd want an index on the path
property for both the Folder
and File
labels. Of course you wouldn't need to do it this way, depending on your use-case.
The reason that Neo4j can be so much faster in this case is because once you find a node on disk the relationships can be traversed with just a few file accesses as opposed to searching an entire table or index for each hop. I recommend checking out the free book Graph Databases by O'Reilly for details on the internals of Neo4j.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With