I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.
A quick summary of the question is as follows:
I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:
hierarchyid
(materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has a Name
column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those.My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.
For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.
So my question again: Is there a better (more efficient) way to perform this search?
I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:
NamePath
column that is actually a nested sequence of strings, using a CLR type, similar to hierarchyid
itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!
take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.
Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.
It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.
Please post back what you find, if possible :-)
Hi Aarron, I have the following idea:
From your description I have the following image in my mind:
Aaron
/ \
/ \
/ \
House Cars
| / \
Living Room Ferrari Mercedes
|
Liquor Cabinet
/ | \
Table Door Window
This is how your search tree might look like. Now I would sort the nodes on every level:
Aaron
/ \
/ \
/ \
Cars House
/ \ /
/ \ /
/ \ /
/ \ /
/ X
/ / \
/ / \
/ / \
/ / \
| / \
| / \
Ferrari Living Room Mercedes
|
Liquor Cabinet
/ | \
Door Table Window
Now it should be easy and fast to process a query:
Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).
When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.
Implementation:
You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:
This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.
One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.
I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:
This improves the runtime to O(log N + M * (log N)).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With