Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently store and read back a Hierarchy from cache

My situation is that I'm currently storing a hierarchy in a SQL database thats quickly approaching 15000 nodes ( 5000 edges ). This hierarchy is defining my security model based off a users position in the tree, granting access to items below. So when a user requests a list of all secured items, I'm using CTE to recurse it in the db ( and flatten all items ), which is started to show its age ( slow ).

The hierarchy is not changing often so I've attempted to move it into RAM ( redis ). Keeping in mind i have many subsystems that need this for security calls, and UI's to build the tree for CRUD operations.

First Attempt

My first attempt is to store the relationships as a key value pair (this is how its stored in the database )

       E
     /   \
    F     G
   / \   /  \
  H  I  J    K

mapped to:
    E - [F, G]
    F - [H, I]
    G - [J, K]

So when i want E and all its decedents, i recursively get its child and their child using the keys, and it allows me to start at any node to move down. This solution gave a good speed increase but with 15,000 nodes, it was approximately 5000 cache hits to rebuild my tree in code ( Worse case scenario... starting at E. performance is based off the starting nodes location, resulting in super users seeing the worst performance). This was still pretty fast but seemed to chatty. I like the fact that i can remove a node at anytime by popping it out of the keys List without rebuilding my entire cache. This was also lighting fast to build a tree on demand visually on a UI.

Second Attempt

My other Idea is to to take the Hierarchy from the Database, build the tree and store that in RAM ( redis ) then pull the entire thing out of memory ( it was approx 2 MB in size, serialized ). This gave me a single call ( not as chatty ) into redis to pull the entire tree out, locate the users parent node, and descend to get all child items. These calls are frequent and passing down 2 MB at the network layer seemed large. This also means i cannot easily add/remove and item without pulling down the tree and editing and pushing it all back. Also on demand trees building via HTTP meant each request had to pull down 2MB to only get direct children ( very small using the first solution ).


So which solution do you think is a better approach ( long term as it continues to grow ). Both are defiantly faster and take some load off the database. Or is their a better way to accomplish this that i have not thought about?

Thanks

like image 650
Waterboy4800 Avatar asked Nov 15 '11 23:11

Waterboy4800


2 Answers

Let me offer an idea...

Use hierarchical versioning. When a node in the graph is modified, increment its version (a simple int field in the database), but also increment versions of all of its ancestors.

  • When getting a sub-tree from the database for the first time, cache it to RAM. (You can probably optimize this through recursive CTE and do it in a single database round-trip.)
  • However, the next time you need to retrieve the same sub-tree, retrieve only the root. Then compare the cached version with the version you just fetched from the database.
    • If they match, great, you can stop fetching and just reuse the cache.
    • If they don't, fetch the children and repeat the process, refreshing the cache as you go.

The net result is that more often than not, you'll cull the fetching very early, usually after only one node, and you won't even need to cache the whole graph. Modifications are expensive, but this shouldn't be a problem since they are rare.

BTW, a similar principle would work in the opposite direction - i.e. when you start with a leaf and need to find the path to the root. You'd need to update the versioning hierarchy in the opposite direction, but the rest should work in a very similar manner. You could even have both directions in combination.

--- EDIT ---

If your database and ADO.NET driver support it, it might be worth looking into server notifications, such as MS SQL Server's SqlDependency or OracleDependency.

Essentially, you instruct the DBMS to monitor changes and notify you when they happen. This is ideal for keeping your client-side cache up-to-date in an efficient way.

like image 120
Branko Dimitrijevic Avatar answered Oct 14 '22 02:10

Branko Dimitrijevic


If hierarchy is not changed often, you can calculate whole list of items below for each node (instead of just direct children). This way you will need significantly more RAM, but it will work lightning-fast for any user, because you will be able to read whole list of descendant nodes in single read.

For your example (I'll use JSON format):

E - {"direct" : [F, G], "all" : [F, G, H, I, J, K]}
F - {"direct" : [H, I], "all" : [H, I]}
G - {"direct" : [J, K], "all" : [J, K]}

Well, for superusers you will still need to transfer alot of data per request, but I don't see any way to make it lesser.

like image 32
mephisto123 Avatar answered Oct 14 '22 01:10

mephisto123