Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read all possible sequential substrings in Python

Tags:

python

If I have a list of letters, such as:
word = ['W','I','N','E']
and need to get every possible sequence of substrings, of length 3 or less, e.g.:
W I N E, WI N E, WI NE, W IN E, WIN E etc.
What is the most efficient way to go about this?

Right now, I have:

word = ['W','I','N','E']
for idx,phon in enumerate(word):
    phon_seq = ""
    for p_len in range(3):
        if idx-p_len >= 0:
            phon_seq = " ".join(word[idx-(p_len):idx+1])
            print(phon_seq)

This just gives me the below, rather than the sub-sequences:

W
I
W I
N
I N
W I N
E
N E
I N E

I just can't figure out how to create every possible sequence.

like image 797
Adam_G Avatar asked Nov 06 '14 23:11

Adam_G


3 Answers

Try this recursive algorithm:

def segment(word):
  def sub(w):
    if len(w) == 0:
      yield []
    for i in xrange(1, min(4, len(w) + 1)):
      for s in sub(w[i:]):
        yield [''.join(w[:i])] + s
  return list(sub(word))

# And if you want a list of strings:
def str_segment(word):
  return [' '.join(w) for w in segment(word)]

Output:

>>> segment(word)
[['W', 'I', 'N', 'E'], ['W', 'I', 'NE'], ['W', 'IN', 'E'], ['W', 'INE'], ['WI', 'N', 'E'], ['WI', 'NE'], ['WIN', 'E']]

>>> str_segment(word)
['W I N E', 'W I NE', 'W IN E', 'W INE', 'WI N E', 'WI NE', 'WIN E']
like image 130
irrelephant Avatar answered Nov 14 '22 23:11

irrelephant


As there can either be a space or not in each of three positions (after W, after I and after N), you can think of this as similar to bits being 1 or 0 in a binary representation of a number ranging from 1 to 2^3 - 1.

input_word = "WINE"
for variation_number in xrange(1, 2 ** (len(input_word) - 1)):  
    output = ''
    for position, letter in enumerate(input_word):
        output += letter
        if variation_number >> position & 1:
            output += ' '
    print output

Edit: To include only variations with sequences of 3 characters or less (in the general case where input_word may be longer than 4 characters), we can exclude cases where the binary representation contains 3 zeroes in a row. (We also start the range from a higher number in order to exclude the cases which would have 000 at the beginning.)

for variation_number in xrange(2 ** (len(input_word) - 4), 2 ** (len(input_word) - 1)):  
    if not '000' in bin(variation_number):
        output = ''
        for position, letter in enumerate(input_word):
            output += letter
            if variation_number >> position & 1:
                output += ' '
        print output
like image 33
Stuart Avatar answered Nov 14 '22 21:11

Stuart


My implementation for this problem.

#!/usr/bin/env python

# this is a problem of fitting partitions in the word
# we'll use itertools to generate these partitions
import itertools

word = 'WINE'

# this loop generates all possible partitions COUNTS (up to word length)
for partitions_count in range(1, len(word)+1):
    # this loop generates all possible combinations based on count
    for partitions in itertools.combinations(range(1, len(word)), r=partitions_count):

        # because of the way python splits words, we only care about the
        # difference *between* partitions, and not their distance from the
        # word's beginning
        diffs = list(partitions)
        for i in xrange(len(partitions)-1):
            diffs[i+1] -= partitions[i]

        # first, the whole word is up for taking by partitions
        splits = [word]

        # partition the word's remainder (what was not already "taken")
        # with each partition
        for p in diffs:
            remainder = splits.pop()
            splits.append(remainder[:p])
            splits.append(remainder[p:])

        # print the result
        print splits
like image 35
Reut Sharabani Avatar answered Nov 14 '22 21:11

Reut Sharabani