Counting in Wonderland

Question

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).

However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.

Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to WONDER what was going to happen next. First, she tried to Look down AND make out what she was coming to, but it was too dark to see anything;

(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)

How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.

I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.

IVlad · Accepted Answer

Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.

Now let's say you want wo. For each first character of the current pattern you find, you add to your count:

How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])

Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:

For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.
For the second w, add 1 because o occurs once after it.
Answer is 3, which is wrong.

But this is what actually happens:

(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
                                            -> (-, o) -> 0
           -> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
                                    -> (o, o) -> (-, -) -> 1
                                              -> (-, o) -> 0

So you can see that the answer is 2.

If you don't find a w, then the count for this position is just how many times wo occurs after this current character.

This allows for dynamic programming with memoization:

count(story_index, pattern_index, dp):
  if dp[story_index, pattern_index] not computed:
    if pattern_index == len(pattern):
      return 1
    if story_index == len(story):
      return 0

    if story[story_index] == pattern[pattern_index]:
      dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) + 
                                       count(story_index + 1, pattern_index, dp) 
    else:
      dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)

  return dp[story_index, pattern_index]

Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).

Python code, with no memoization:

def count(story, pattern):
  if len(pattern) == 0:
    return 1
  if len(story) == 0:
    return 0

  s = count(story[1:], pattern)
  if story[0] == pattern[0]:
    s += count(story[1:], pattern[1:])

  return s

print(count('wonderlandwonderland', 'wonderland'))

Output:

This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:

wonderlandwonderland
*********    *
********    **
********    *      *
**      **    ******
***      *    ******

You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.

The same program returns 9624 for your example paragraph.

mhum · Answer

The string "wonderland" occurs as a subsequence in Alice in Wonderland¹24100772180603281661684131458232 times.

The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.

The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).

Code

import fileinput
import string

target = 'wonderland'

prefixes = dict()
count = dict()

for i in range(len(target)) :
    letter = target[i]
    prefix = target[:i+1]
    if letter not in prefixes :
        prefixes[letter] = [prefix]
    else :
        prefixes[letter].append(prefix)
    count[prefix] = 0L

for line in fileinput.input() :
    for letter in line.lower() :
        if letter in prefixes :
            for prefix in prefixes[letter] :
                if len(prefix) > 1 :
                    count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
                else:
                    count[prefix] = count[prefix] + 1

print count[target]

Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"

Counting in Wonderland

Tags:

algorithm

math

Colonel Panic

2 Answers

IVlad

Code

mhum

Recent Activity

Donate For Us

Counting in Wonderland

Tags:

algorithm

math

Colonel Panic

2 Answers

IVlad

Code

mhum

Related questions

Recent Activity

Donate For Us