Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can two hierarchies be efficiently merged in SQL Server?

I have two tables with hierarchyid fields, one of which is a staging table with new data that needs to be merged into the other (that is, a set of nodes that need to be added to the main tree, some of which might already be there).

In addition to the hierarchyid column that defines the tree structure (parent/child relationships). each table has a separate column which contains a node identifier that uniquely identifies each node. That is, the way to tell if a node from the staging table is already in the main table is via the node ID, not via the hierarchyid columns.

Imperatively, the processing that needs to be performed would look something like this:

For each row, RS, in the staging table:
    If there is not already a row with the same Id as RS in the main table:
         Find the parent, PS, of the staging row
         Find the row, PM, in the main table that has the same node ID as PS
         Create a new child, RM of row PM
         Set PM's ID equal to the ID of RS

Importantly, this approach will only work if the tree in the staging table is sorted/traversed in breadth-first order - this is so that when RS is encountered, it is guaranteed that its parent PS already has a corresponding row in the main table.

So far the only way I can see to achieve this in SQL server is to use a cursor over the staging table (which is already sorted) and call a stored procedure for each row that essentially does exactly what is described above, complete with a SELECT MAX() to find the highest hierarchyid that already exists as a child of PM, so that the child can be added uniquely.

This is an awfully inefficient approach, though, and waaaay too slow for my purposes. Is there any better way?

For background, this is kind of a feasibility check I'm doing. I need to figure out whether I can perform this operation quickly inside SQL Server. If it turns out I can't I will have to do it another way, outside the database. The merging of the trees is inherent to (actually, in a sense kind of is) the problem domain, so structuring the data differently or taking a wider view and trying to somehow avoid performing this operation altogether is not an option.

Update

As requested, here's a concrete example.

Tables "staging" and "main" both have the same two columns:

   hierarchy_id of type hierarchyid
   node_id of type bigint

Initial contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4

staging:

 hierarchy_id    node_id
 /1/             1
 /1/1/           3
 /1/2/           5
 /1/1/1/         6

Desired contents

main:

 hierarchy_id    node_id
 /1/             1
 /1/1/           2
 /1/2/           3
 /1/3/           4
 /1/4/           5
 /1/2/1/         6

Note that the node in the staging table with hierarchy_id /1/1/ corresponds to that with hiearchy_id /1/2/ in the target table (which is why the node_id is important - can't just copy hierarchy_id values across). Also note that the new node with node_id 6 is added as a child of the correct parent, the one with node_id 3, which is why the hierarchy_id is important - it defines the tree structure (parent/child relationships) for any new nodes. Any solution needs to take account of both aspects.

like image 878
Tom Avatar asked Aug 14 '11 17:08

Tom


1 Answers

Modeling your hierarchy in this way will lead to problems. The hierarchy_id column violates first normal form and the process for merging is going to be prone to update anomalies if you don't serialize/bottleneck access.

You should consider a table with just node_id and parent_id, see how it trivializes your merge problem

node_id   parent_id
1         NULL
2         1
3         2
4         3

node_id   parent_id
1         NULL
3         1
5         2
6         1

You would use recursive queries with this and you might be surprised how efficient the execution plans turn out. If you must have the flattened hierarchy column you can probably create an indexed view with a recursive query.

like image 60
gordy Avatar answered Sep 29 '22 21:09

gordy