Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to count the number of distinct substrings in a string in O(n)?

Given a string s of length n, is it possible to count the number of distinct substrings in s in O(n)?

Example

Input: abb

Output: 5 ('abb', 'ab', 'bb', 'a', 'b')

I have done some research but i can't seem to find an algorithm that solves this problem in such an efficient way. I know a O(n^2) approach is possible, but is there a more efficient algorithm?

I don't need to obtain each of the substrings, just the total number of distinct ones (in case it makes a difference).

like image 462
donrondon Avatar asked Jan 19 '16 17:01

donrondon


People also ask

How do you find the number of distinct substrings?

Number of substrings of a string of length n is (n * (n + 1) / 2).

How many substrings does a string of length n have?

Approach: The count of sub-strings of length n will always be len – n + 1 where len is the length of the given string.

How many substrings are possible?

A substring is completely defined by two parameters, such as start and length, in the range [1:n], so there are no more than n^2 possibilities. A general, non-contiguous, subset needs n true-false decisions to specify it, so there are 2^n possibilities.


2 Answers

You can use Ukkonen's algorithm to build a suffix tree in linear time:

https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm

The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.

For instance, your example produces a suffix tree like:

            /\                
           b  a
           |  b
           b  b

5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.

More precisely:

  • Every substring is the prefix of some suffix of the string;
  • All the suffixes are in the trie;
  • So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
  • There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
    • each distinct non-empty path ends at a distinct position after its last letter; and
    • the path to the the position following each letter is unique

NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:

There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.

like image 163
Matt Timmermans Avatar answered Sep 27 '22 18:09

Matt Timmermans


Construct the LCP array and subtract its sum from the number of substrings (n(n+1)/2).

like image 27
David Eisenstat Avatar answered Sep 27 '22 17:09

David Eisenstat