Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High-Performance Hierarchical Text Search

I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.

A quick summary of the question is as follows:

How would you implement a hierarchical search that matches several search terms at different levels in the hierarchy, optimized for fastest search time?


I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:

  • The end goal is to find one or several arbitrary items at arbitrary positions in the hierarchy.
  • The complete hierarchy is about 80,000 nodes, projected to grow up to 1M within a few years.
  • The full text of an entire path down the hierarchy is unique and descriptive; however, the text of an individual node may not be. This is a business reality, and not a decision that was made lightly.
  • Example: a node might have a name like "Door", which is meaningless by itself, but the full context, "Aaron > House > Living Room > Liquor Cabinet > Door", has clear meaning, it describes a specific door in a specific location. (Note that this is just an example, the real design is far less trivial)
  • In order to find this specific door, a user might type "aaron liquor door", which would likely turn up only one result. The query is translated as a sequence: An item containing the text "door", under an item containing the text "liquor", under another item containing the text "aaron."
  • Or, a user might just type "house liquor" to list all the liquor cabinets in people's houses (wouldn't that be nice). I mention this example explicitly to indicate that the search need not match any particular root or leaf level. This user knows exactly which door he is looking for, but can't remember offhand who owns it, and would remember if the name popped up in front of him.
  • All terms must be matched in the specified sequence, but as the above examples suggest, levels in the hierarchy can be "skipped." The term "aaron booze cabinet" would not match this node.
  • The platform is SQL Server 2008, but I believe that this is a platform-independent problem and would prefer not to restrict answers to that platform.
  • The hierarchy itself is based on hierarchyid (materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has a Name column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those.
  • There is no strict hierarchy - a root may contain no nodes at all or may contain 30 subtrees fanning out to 10,000 leaf nodes.
  • The maximum nesting is arbitrary, but in practice it tends to be no more than 4-8 levels.
  • The hierarchy can and does change, although infrequently. Any node can be moved to any other node, with the obvious exceptions (parent can't be moved into its own child, etc.)
  • In case this wasn't already implied: I do have control over the design and can add indexes, fields, tables, whatever might be necessary to get the best results.

My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.

For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.

So my question again: Is there a better (more efficient) way to perform this search?

I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:

  • Create a delimited "path text" column and index it with Full-Text Search. The trouble is that a search on this column would return all child nodes as well; "aaron house" also matches "aaron house kitchen" and "aaron house basement".
  • Created a NamePath column that is actually a nested sequence of strings, using a CLR type, similar to hierarchyid itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.

It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!

like image 422
Aaronaught Avatar asked Dec 30 '09 23:12

Aaronaught


2 Answers

take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.

Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.

It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.

Please post back what you find, if possible :-)

like image 156
srini.venigalla Avatar answered Sep 22 '22 13:09

srini.venigalla


Hi Aarron, I have the following idea:
From your description I have the following image in my mind:

          Aaron
         /     \
        /       \
       /         \
  House           Cars
    |            /    \
Living Room   Ferrari  Mercedes
    |
Liquor Cabinet
    /    |    \
Table   Door   Window

This is how your search tree might look like. Now I would sort the nodes on every level:

               Aaron
              /     \
             /       \
            /         \
         Cars         House
         /   \       /
        /     \     /
       /       \   /
      /         \ /
     /           X
    /           / \
   /           /   \
  /           /     \
 /           /       \
|           /         \
|          /           \ 
Ferrari   Living Room   Mercedes
                        |
                  Liquor Cabinet
                   /    |    \
               Door   Table   Window

Now it should be easy and fast to process a query:

  1. Start with the last word in the query and the lowest node level(leafs)
  2. Since all the nodes are sorted within one level, You can use binary search and therefore find a match in O(log N) time, where N is the node count.
  3. Do this for every level. There are O(log N) levels in the tree.
  4. Once You find a match, process all parent nodes to see, if the path matches your query. The path has length O(log N). If it matches, store it in the results, that should be shown to the user.

Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).

When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.

Implementation: You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:

  • ID : int
  • Text : string
  • Parent : int -> node ID
  • Level : int //I don't expect this to change too often, so You can save it and update it, as the database changes.

This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.

One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.

Edit: Improvement

I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:

  1. Store all nodes sorted by their text value
  2. Find all matches for the last word of the query at once using binary search over all nodes.
  3. From every match, trace the path up to the root and check if the path matches the whole query.

This improves the runtime to O(log N + M * (log N)).

like image 42
Dave O. Avatar answered Sep 22 '22 13:09

Dave O.