This is an interview question. Given a number of strings find such strings, which are prefixes of others. For example, given strings = {"a", "aa", "ab", abb"}
the result is {"a", "ab"}
.
The simplest solution is just to sort the strings and check each pair of two subsequent strings if the 1st one is a prefix of the 2nd one. The running time of the algorithm is the running time of the sorting.
I guess there is another solution, which uses a trie
, and has complexity O(N)
, where N is the number of strings. Could you suggest such an algorithm?
I have a following idea regarding Trie, complexity O(N): You start with empty Trie. You take words one by one, and add word to Trie. After you add a word (let's call it word Wi) to Trie, there are two cases to consider:
In more details (pseudocode):
for word in words
add word to trie
if size of trie did not change then // first case
add word to result
if ending nodes found while adding word // second case
add words defined by those nodes to result
return result
Adding new word to Trie:
node = trie.root();
for letter in word
if node.hasChild(letter) == false then // if letter doesnt exist, add it
node.addChild(letter)
if letter is last_letter_of_word then // if last letter of word, store that info
node.setIsLastLetterOf(word)
node = node.getChild(letter) // move
While you are adding new word, you can also check if you passed through any nodes that represent last letters of other words. Complexity of algorithm that I described is O(N). Another important thing is that this way you can know how many times word Wi prefixes other words, which may be useful.
Example for {aab, aaba, aa}: Green nodes are nodes detected as case 1. Red nodes are nodes detected as case 2. Each column(trie) is one step. At the beginning trie is empty. Black arrows show which nodes we visited(added) in that step. Nodes that represent last letter of some word have that word written in parenthesess.
At the end we have result = {aab, aa} which is correct.
The original answer is correct for: is a string a
a substring of b
(misread).
Using a trie, you can simply add all strings to it in a first iteration, and in the 2nd iteration, start reading each word, let it be w
. If you find a word that you finished your read, but did not reach the string terminator ($
usually), you reach some node v
in the trie.
By doing a DFS from v
, you can get all strings which w
is prefix of them.
high level pseudo code:
t <- new trie
for each word w:
t.add(w)
for each word w:
node <- t.getLastNode(w)
if node.val != $
collection<- DFS(node) (excluding w itself)
w is a prefix of each word in collection
Note: in order to optimize it, you might need to do some extra work: if a
is prefix of b
, and b
is prefix of c
, then a
is prefix of c
, so - when you do the DFS, if you reach some node that was already searched - just append its strings to the current prefix.
Still, since there could be quadric number of possibilities ("a", "aa", "aaa", ....
), getting all of them requires quadric time.
Original answer: finding if a
is a substring of b
:
The suggested solution runs in a quadric complexity, you will need to check each two pairs, giving you O(n* (n-1) * |S|)
.
You can build a suffix tree from the strings in the first iteration, and in the 2nd iteration check if each string is a non trivial entry (not itself) of another string.
This solution is O(n*|S|)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With