I'm evaluating a nosql solution for implement a file system like structure, with millions of items, where key features have to be: <ul> <li>speed finding "parents" or "direct childs" or "subtree childs" of an item filtered by n item properties, with page results sorted by item property.</li> </ul> having this requirements i split the problem in 2 task: <ol> <li>model the recursive items structure for search childs/subtree childs</li> <li>model the item structure for search over items property</li> </ol> Now the power of nosql schema free is a good feature for store different properties for each file, and this is good for point 2. I have instead some doubt over point 1 about the pros / cons to use a document database (example mongodb) with a single collection of items and a materialized path design pattern, or using a graph database (example arangodb) with 2 collection: items for data (document collection), and itemsParents for parent-child relation (edge collection) and a graph traverse function. There are advantages in performance using a graph database for my requirements? graph traverse is more efficient over materialized path filter to accomplish my task? If yes, can you explain me why? Thanks

A graph database would certainly be a great choice for a hierarchical structure like a filesystem. Speaking specifically of Neo4j you could have a schema such as: <pre class="prettyprint"><code>(:Folder)-[:IN_FOLDER]->(:Folder) (:File)-[:IN_FOLDER]->(:Folder) </code></pre> Finding a file or a folder is as simple as the following Cypher: <pre class="prettyprint"><code>MATCH (file:File {path: '/dir/file'}) RETURN file </code></pre> To find all of the files/folders directly under a folder: <pre class="prettyprint"><code>MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER]-(file_or_folder) RETURN file_or_folder </code></pre> If you wanted to find all files/folders recursively you could do: <pre class="prettyprint"><code>MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER*1..5]-(file_or_folder) RETURN file_or_folder </code></pre> The <code>1..5</code> adjusts the depth (from one to five levels) which you are searching. For all of these you'd want an index on the <code>path</code> property for both the <code>Folder</code> and <code>File</code> labels. Of course you wouldn't need to do it this way, depending on your use-case. The reason that Neo4j can be so much faster in this case is because once you find a node on disk the relationships can be traversed with just a few file accesses as opposed to searching an entire table or index for each hop. I recommend checking out the free book Graph Databases by O'Reilly for details on the internals of Neo4j.

data model for tree structure (file system): document model vs graph model

Tags:

mongodb

graph-databases

orientdb

neo4j

arangodb

I'm evaluating a nosql solution for implement a file system like structure, with millions of items, where key features have to be:

speed finding "parents" or "direct childs" or "subtree childs" of an item filtered by n item properties, with page results sorted by item property.

having this requirements i split the problem in 2 task:

model the recursive items structure for search childs/subtree childs
model the item structure for search over items property

Now the power of nosql schema free is a good feature for store different properties for each file, and this is good for point 2.

I have instead some doubt over point 1 about the pros / cons to use a document database (example mongodb) with a single collection of items and a materialized path design pattern, or using a graph database (example arangodb) with 2 collection: items for data (document collection), and itemsParents for parent-child relation (edge collection) and a graph traverse function.

There are advantages in performance using a graph database for my requirements?

graph traverse is more efficient over materialized path filter to accomplish my task?

If yes, can you explain me why?

Thanks

378

asked Oct 18 '15 22:10

Claudio

1 Answers

A graph database would certainly be a great choice for a hierarchical structure like a filesystem. Speaking specifically of Neo4j you could have a schema such as:

(:Folder)-[:IN_FOLDER]->(:Folder)
(:File)-[:IN_FOLDER]->(:Folder)

Finding a file or a folder is as simple as the following Cypher:

MATCH (file:File {path: '/dir/file'})
RETURN file

To find all of the files/folders directly under a folder:

MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER]-(file_or_folder)
RETURN file_or_folder

If you wanted to find all files/folders recursively you could do:

MATCH (folder:Folder {path: '/dir'})<-[:IN_FOLDER*1..5]-(file_or_folder)
RETURN file_or_folder

The 1..5 adjusts the depth (from one to five levels) which you are searching.

For all of these you'd want an index on the path property for both the Folder and File labels. Of course you wouldn't need to do it this way, depending on your use-case.

The reason that Neo4j can be so much faster in this case is because once you find a node on disk the relationships can be traversed with just a few file accesses as opposed to searching an entire table or index for each hop. I recommend checking out the free book Graph Databases by O'Reilly for details on the internals of Neo4j.

145

answered Sep 30 '22 14:09

Brian Underwood

Related questions
                            
                                MongoDB Unauthorized: replSetGetConfig
                            
                                mongoDB Atlas ♥︎ Sailsjs (waterline sails-mongo adapter)
                            
                                MongoDB Zip Installation Failed in Ubuntu 18.04
                            
                                $replaceRoot in mongodb
                            
                                current URL string parser is deprecated in mongo db and nodejs
                            
                                Docker and mongo-go-driver "server selection error"
                            
                                Connecting to a remote mongoDB server
                            
                                Mongo DB issues with replSet
                            
                                node.js mongodb how to connect to replicaset of mongo servers
                            
                                MongoDB fetch documents with sort by count
                            
                                MongoDB from MLab : find by ID not working
                            
                                Are there any hosting solutions for asp.net and MongoDB out there? [closed]
                            
                                nodejs - mongodb - how to find all where a != b? [duplicate]
                            
                                PyMongo Cursor Iteration
                            
                                C# MongoDb Connect to Replica Set Issue
                            
                                export a csv from mongodb
                            
                                Mongodb + Java Drivers. Search by date range
                            
                                Group by sum mongodb [duplicate]
                            
                                MongoDB : Unable to drop a Compound Index on a collection
                            
                                Mongodb: updating a variable field name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With