Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How and when to create a suffix link in suffix tree?

Could anyone give me an example about how and when to create a suffix link in suffix tree?

If my string is ABABABC, but do use a different example if that is better.

Hope to give some pictures to illustrate every step.

very appreciate.

like image 919
lingguang1997 Avatar asked Apr 16 '12 02:04

lingguang1997


People also ask

What is suffix link in suffix tree?

Suffix linksIf there is a node in the tree with a label , where is a character and is a string (non-empty), then the suffix link of points to , which is a node with label . If is empty, then the suffix link of , i.e, is the root. Suffix links exist for every internal (non-leaf) node of a suffix tree.

How do you make a suffix tree?

We build a suffix tree by following each suffix and creating an edge for each character, starting with a top node. If the new suffix to be put in the tree begins with a set of characters that are already in the tree, we follow those characters until we have a different one, creating a new branch.

How do suffix trees work?

In computer science, a suffix tree (also called PAT tree or, in an earlier form, position tree) is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values. Suffix trees allow particularly fast implementations of many important string operations.

How much time does a suffix tree takes to construct?

It allows fast string operation. Total time taken for construction of suffix tree is linear to the length of the tree. 4.


1 Answers

To understand this, first recall that there are three kinds of nodes in a suffix tree:

  • The root
  • Internal nodes
  • Leaf nodes

In the graph below, which is the suffix tree for ABABABC, the yellow circle is the root, the grey, blue and green ones are internal nodes, and the small black ones are leaves.

There are two important things to notice:

  • Internal nodes always have more than 1 outgoing edge. That is, internal nodes mark those parts of the tree where branching occurs.

  • Branching occurs wherever a repeated string is involved, and only there. For any internal node X, the string leading from the root to X must have occurred in the input string at least as many times as there are outgoing edges from X.

Example: The string leading to the blue node is ABAB. Indeed, this string appears twice in the input string: At position 0 and at position 2. And that is why the blue node exists.

Now about suffix links:

  1. If the string s leading up to some internal node X is longer than 1 character, the same string minus the first character (call this s-1) must be in the tree, too (it's a suffix tree, after all, so the suffix of any of its strings must be in the tree, too).

    Example: Let s=ABAB, the string leading to the blue node. Then after removing the first character, s-1 is BAB. And indeed that string is found in the tree, too. It leads to the green node.

  2. If some string s leads to an internal node, its shortened version s-1must lead to an internal node (call it X-1) as well. Why? Because s must appear at least twice in the input string, so s-1 must appear at least as many times (because it is part of s: wherever s appears, s-1 must appear, too). But if s-1 appears multiple times in the input string, then there must be an internal node for it.

  3. In any such situation, a special link connecting X to X-1 is a suffix link.

Note: Because of (1) and (2) above, every internal node X that has a label from root to X of more than 1 character must have a suffix link to exactly one other internal node.

Example:

This is the same suffix tree as before; the dotted lines indicate the suffix links. If you start at the blue node and follow the suffix links from there (from blue, to green, to first gray, to second gray), and look at the strings leading from the root to each node, you will see this:

 ABAB   ->    BAB    ->    AB    ->    B
(blue)      (green)     (gray1)     (gray2)

This is why they are called suffix links (the entire sequence is called suffix chain).

What are they good for?

They are good for surprisingly many things. However, they play a particular role in Ukkonen's algorithm for suffix tree construction, specifically in Rule 3 described there: After inserting a the final character of some suffix s at some point X, the algorithm needs to insert the final character of suffix s-1 in O(1) time. In order to do that, it uses the suffix link to jump right to the place X-1 and makes the insert.

But, note that there is no necessity to put suffix links in a suffix tree. They are not part of the definition of a suffix tree — they are just special links used by some algorithms that construct or use suffix trees.


Regarding single-character nodes: What if there is an internal node X whose string (i.e. the string on the path from root to X) consists of only one character? By the definition above, X then does not have a suffix link. You can however assume that if it had a suffix link, it would point to the root node. Furthermore: If, by the definition above, an internal node does not have a suffix link, it must be a single-character node, so you can always assume that if no suffix link is present at an internal node it must be a single-character node, and therefore, the node that represents the s-1 suffix is the root node. (Some algorithms may actually add an explicit suffix link pointing to the root node in this case.) Thanks to j_random_hacker for the comment about this.

like image 97
jogojapan Avatar answered Nov 02 '22 23:11

jogojapan