Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing a deep directory tree in a database

I am working on a desktop application that is much like WinDirStat or voidtools' Everything - it maps hard drives, i.e. creates a deeply nested dictionary out of the directory tree.

The desktop application should then store the directory trees in some kind of database, so that a web application can be used to browse them from root, depth level by depth level.

Assume both applications run locally on the same machine for the time being.

The question that comes to mind is how the data should be structured and what database should be utilized, considering: 1) RAM consumption should be reasonable 2) The time it takes to for the directory to be ready for viewing in the web application should be minimal

P.S - My initial approach was serializing each file system node to JSON separately and inserting each into Mongo, with object references linking them to their children. That way the web application could easily load the data based on user demand. However, I am worried that making so many (a million, by average) independent inserts to Mongo will take a lot of time; if I make bulk inserts that means I have to keep each bulk in memory.

I also considered dumping the entire tree as one deeply nested JSON, but the data is too large to be a Mongo document. GridFS can be used to store it, but then I would have the load the entire tree in the web application even though the deep nodes may not be of interest.

like image 208
Kfir Eichenblat Avatar asked Oct 15 '18 09:10

Kfir Eichenblat


1 Answers

Given your requirements of:

  • A) Low RAM usage
  • B) Meeting file size limitations in Mongo
  • C) A responsive UI

I'd consider something along the lines of the following.

Take this example directory

C:\
C:\X\
C:\X\B\
C:\X\file.txt
C:\Y\
C:\Y\file.pdf
C:\Y\R\
C:\Y\R\file.js

In JSON it could possibly be represented as:

{
    "C:": {
        "X": {
            "B": {},
            "file.txt": "file information..."
        },
        "Y": {
            "file.pdf": "file information...",
            "R": {
                "file.js": "file information..."
            }
        }
    }
}

The latter, as you pointed out, does not scale well with large directory structures (I can tell you first hand that browsers will not appreciate a JSON blob representing even a modest directory with a few thousand files/folders). The former, though akin to some actual filesystems and efficient in the right context, is a pain to work with converting to and from JSON.

My proposal is to break each directory into a separate JSON document, as this will address all three issues, however nothing is free, and this will increase code complexity, number of requests per session, etc.

The above structure could be broken into the following documents:

[
    {
        "id": "00000000-0000-0000-0000-000000000000",
        "type": "d",
        "name": "C:",
        "children": [
            "11111111-1111-1111-1111-111111111111",
            "22222222-2222-2222-2222-222222222222"
        ]
    },
    {
        "id": "11111111-1111-1111-1111-111111111111",
        "type": "d",
        "name": "X",
        "children": [
            "33333333-3333-3333-3333-333333333333",
            "55555555-5555-5555-5555-555555555555"
        ]
    },
    {
        "id": "22222222-2222-2222-2222-222222222222",
        "type": "d",
        "name": "Y",
        "children": [
            "44444444-4444-4444-4444-444444444444",
            "66666666-6666-6666-6666-666666666666"
        ]
    },
    {
        "id": "33333333-3333-3333-3333-333333333333",
        "type": "d",
        "name": "B",
        "children": []
    },
    {
        "id": "44444444-4444-4444-4444-444444444444",
        "type": "d",
        "name": "R",
        "children": [
            "77777777-7777-7777-7777-777777777777"
        ]
    },
    {
        "id": "55555555-5555-5555-5555-555555555555",
        "type": "f",
        "name": "file.txt",
        "size": "1024"
    },
    {
        "id": "66666666-6666-6666-6666-666666666666",
        "type": "f",
        "name": "file.pdf",
        "size": "2048"
    },
    {
        "id": "77777777-7777-7777-7777-777777777777",
        "type": "f",
        "name": "file.js",
        "size": "2048"
    }
]

Where each document represents a directory or file and (if directory) its immediate child IDs. Child items can be lazy loaded using their IDs and appended to their parent in the UI. Well implemented lazy loading can pre load child nodes to a desired depth, creating a very responsive UI. RAM usage is minimal as your sever only has to handle small payloads per request. The number of requests does go up considerably versus a single document approach, but again, some clever lazy-loading can cluster requests and reduce the total number.

UPDATE 1: I somehow overlooked your second last paragraph before answering, so this is probably more or less what you had in mind. To address the issue of too many documents some level of clustering nodes within documents may be in order. I have to head off now but I'll give it some thought.

UPDATE 2: I've created a gist of a simplified version of the clustering concept I mentioned. It doesn't take into account files, just folders, and it isn't doesn't include and code to update the documents. Hopefully it'll give you some ideas, I'll continue to update it for my own purposes.

Gist: tree_docs_cluster.js

like image 130
Richard Dunn Avatar answered Sep 28 '22 01:09

Richard Dunn