Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are there any cleverly efficient algorithms to perform a calculation over the space of partitionings of a string?

I'm working on a statistical project that involves iterating over every possible way to partition a collection of strings and running a simple calculation on each. Specifically, each possible substring has a probability associated with it, and I'm trying to get the sum across all partitions of the product of the substring probability in the partition.

For example, if the string is 'abc', then there would be probabilities for 'a', 'b', 'c', 'ab, 'bc' and 'abc'. There are four possible partitionings of the string: 'abc', 'ab|c', 'a|bc' and 'a|b|c'. The algorithm needs to find the product of the component probabilities for each partitioning, then sum the four resultant numbers.

Currently, I've written a python iterator that uses binary representations of integers for the partitions (eg 00, 01, 10, 11 for the example above) and simply runs through the integers. Unfortunately, this is immensely slow for strings longer than 20 or so characters.

Can anybody think of a clever way to perform this operation without simply running through every partition one at a time? I've been stuck on this for days now.

In response to some comments here is some more information:
The string can be just about anything, eg "foobar(foo2)" -- our alphabet is lowercase alphanumeric plus all three type of braces ("(","[","{"), hyphens and spaces.
The goal is to get the likelihood of the string given individual 'word' likelihoods. So L(S='abc')=P('abc') + P('ab')P('c') + P('a')P('bc') + P('a')P('b')P('c') (Here "P('abc')" indicates the probability of the 'word' 'abc', while "L(S='abc')" is the statistical likelihood of observing the string 'abc').

like image 456
Peter M Avatar asked Dec 28 '25 22:12

Peter M


1 Answers

A Dynamic Programming solution (if I understood the question right):

def dynProgSolution(text, probs):
  probUpTo = [1]
  for i in range(1, len(text)+1):
    cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
    probUpTo.append(cur)
  return probUpTo[-1]

print dynProgSolution(
  'abc',
  {'a': 0.1, 'b': 0.2, 'c': 0.3,
   'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
  )

The complexity is O(N2) so it will easily solve the problem for N=20.

How why does this work:

  • Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
  • Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
  • For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)
like image 95
yairchu Avatar answered Dec 31 '25 10:12

yairchu