The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).
However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to WONDER what was going to happen next. First, she tried to Look down AND make out what she was coming to, but it was too dark to see anything;
(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)
How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.
I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.
Let's say you didn't want to search for wonderland
, but just for w
. Then you'd simply count how many times w
occurred in the story.
Now let's say you want wo
. For each first character of the current pattern you find, you add to your count:
How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n])
to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])
Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo
. Obviously, wo
occurs 2
times. You might think the counting will go like:
For the first w
, add 1
because o
occurs once after it and another 1
because wo
occurs once after it.
For the second w
, add 1
because o
occurs once after it.
Answer is 3
, which is wrong.
But this is what actually happens:
(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
-> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
-> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
So you can see that the answer is 2
.
If you don't find a w
, then the count for this position is just how many times wo
occurs after this current character.
This allows for dynamic programming with memoization:
count(story_index, pattern_index, dp):
if dp[story_index, pattern_index] not computed:
if pattern_index == len(pattern):
return 1
if story_index == len(story):
return 0
if story[story_index] == pattern[pattern_index]:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) +
count(story_index + 1, pattern_index, dp)
else:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)
return dp[story_index, pattern_index]
Call with count(0, 0, dp)
. Note that you can make the code cleaner (remove the duplicate function call).
Python code, with no memoization:
def count(story, pattern):
if len(pattern) == 0:
return 1
if len(story) == 0:
return 0
s = count(story[1:], pattern)
if story[0] == pattern[0]:
s += count(story[1:], pattern[1:])
return s
print(count('wonderlandwonderland', 'wonderland'))
Output:
17
This makes sense: for each i
first characters in the first wonderland
of the story, you can group it with remaining final characters in the second wonderland
, giving you 10
solutions. Another 2
are the words themselves. The other five are:
wonderlandwonderland
********* *
******** **
******** * *
** ** ******
*** * ******
You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.
The same program returns 9624
for your example paragraph.
The string "wonderland" occurs as a subsequence in Alice in Wonderland124100772180603281661684131458232 times.
The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.
The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).
import fileinput
import string
target = 'wonderland'
prefixes = dict()
count = dict()
for i in range(len(target)) :
letter = target[i]
prefix = target[:i+1]
if letter not in prefixes :
prefixes[letter] = [prefix]
else :
prefixes[letter].append(prefix)
count[prefix] = 0L
for line in fileinput.input() :
for letter in line.lower() :
if letter in prefixes :
for prefix in prefixes[letter] :
if len(prefix) > 1 :
count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
else:
count[prefix] = count[prefix] + 1
print count[target]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With