Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to match a tree against a large set of patterns?

I have a potentially infinite set of symbols: A, B, C, ... There is also a distinct special placeholder symbol ? (its meaning will be explained below).

Consider non-empty finite trees such that every node has a symbol attached to it and 0 or more non-empty sub-trees. The order of sub-trees of a given node is significant (so, for example, if there is a node with 2 sub-trees, we can distinguish which one is left and which one is right). Any given symbol can appear in the tree 0 of more times attached to different nodes. The placeholder symbol ? can be attached only to leaf nodes (i.e. nodes having no sub-trees). It follows from the usual definition of a tree that trees are acyclic.

The finiteness requirement means that the total number of nodes in a tree is a positive finite integer. It follows that the total number of attached symbols, the tree depth and the total number of nodes in every sub-tree are all finite.

Trees are given in a functional notation: a node is represented by a symbol attached to it and, if there are any sub-trees, it is followed by parentheses containing comma-separated list of sub-trees represented recursively in the same way. So, for example the tree

                    A
                   / \
                  ?   B
                     / \
                    A   C
                   /|\
                  A C Q
                       \
                        ?

is represented as A(?,B(A(A,C,Q(?)),C)).

I have a pre-calculated unchanging set of trees S that will be used as patterns to match. The set will typically have ~ 105 trees, and every its element will typically have ~ 10-30 nodes. I can use a plenty of time to create beforehand any representation of S that will best suit my problem stated below.

I need to write a function that accepts a tree T (typically with ~ 102 nodes) and checks as fast as possible if T contains as a subtree any element of S, provided that any node with placeholder symbol ? matches any non-empty subtree (both when it appears in T or in an element of S).

Please suggest a data structure to store the set S and an algorithm to check for a match. Any programing language or a pseudo-code is OK.

like image 596
TauMu Avatar asked Jun 16 '13 03:06

TauMu


2 Answers

This paper describes a variant of the Aho–Corasick algorithm, where instead of using a finite state machine (which the standard Aho–Corasick algorithm uses for string matching) the algorithm instead uses a pushdown automaton for subtree matching. Like the Aho-Corasick string-matching algorithm, their variant only requires one pass through the input tree to match against the entire dictionary of S.

The paper is quite complex - it may be worth it to contact the author to see if he has any source code available.

like image 139
Zim-Zam O'Pootertoot Avatar answered Oct 18 '22 07:10

Zim-Zam O'Pootertoot


What you need is a finite state machine that tracks the set of potential matches you might have.

In essence, such a machine is is the result of matching the patterns against each other, and determining what part of the individual matches they share. This is analogous to how lexers take sets of regular expressions for tokens and compose them into a large FSA that can match any of the regular expressions by processing characters one at a time.

You can find references to methods for doing this under term rewriting systems.

like image 35
Ira Baxter Avatar answered Oct 18 '22 07:10

Ira Baxter