Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Printing all possible phrases (consecutive combinations of words) in a given string

Tags:

python

I'm trying to print phrases in a given text. I want to be able to print every phrase in the text, from 2 words up to the maximum number of words the length of the text will allow for. I have written a program below that prints all phrases up to 5 words in length, but I can't work out a more elegant way to get it print all possible phrases.

My definition of phrase = Consecutive words in a string, regardless of meaning.

def phrase_builder(i):
    phrase_length = 4
    phrase_list = []
    for x in range(0, len(i)-phrase_length):
        phrase_list.append(str(i[x]) + " " + str(i[x+1]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
        phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
    return phrase_list

text = "the big fat cat sits on the mat eating a rat"

print phrase_builder(text.split())

The output for this is:

['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']

I want to be able to print phrases such as "the big fat cat sits on the mat eating" and "fat cat sits on the mat eating a rat" etc.

Can anyone offer some advice please?

like image 722
MLadbrook Avatar asked Jul 25 '14 21:07

MLadbrook


3 Answers

Simply Use itertools.combinations

from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
    print lst[start:end+1]

output:

['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']
like image 168
Kei Minagawa Avatar answered Nov 07 '22 18:11

Kei Minagawa


First, you need to figure out how to write all four of those lines the same way. Instead of concatenating the words and spaces manually, use the join method:

phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))

If the comprehension inside the join method isn't clear, here's how to write it manually:

phrase = []
for y in range(2):
    phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))

Once you've done that, it's trivial to replace those four lines with a loop:

for length in range(2, phrase_length):
    phrase_list.append(" ".join(str(i[x+y]) for y in range(length))

You can simplify this in a couple of other ways independently.

First, i[x+y] for y in range(length) can be done much more easily with a slice: i[x:x+length].

And I'm guessing i is already a list of strings, so you can get rid of the str calls.

Also, range defaults to starting at 0, so you can leave that off.

While we're at it, it would be a lot easier to think about your code if you used meaningful variable names, like words instead of i.

So:

def phrase_builder(words):
    phrase_length = 4
    phrase_list = []
    for i in range(len(words) - phrase_length):
        phrase_list.append(" ".join(words[i:i+phrase_length]))
    return phrase_list

And now your loop is simple enough that you can turn it into a comprehension and the whole thing is a one-liner:

def phrase_builder(words):
    phrase_length = 4
    return [" ".join(words[i:i+phrase_length]) 
            for i in range(len(words) - phrase_length)]

One last thing: As @SoundDefense asked, are you sure you don't want "eating a rat"? It starts less than 5 words from the end, but it's a 3-word phrase in the text.

If you do want that, just remove the - phrase_length part.

like image 43
abarnert Avatar answered Nov 07 '22 20:11

abarnert


I think the simplest approach is to iterate over all the possible start and end positions in the list of words and generate the phrases for the respective sub-lists of words:

def phrase_builder(words):
    for start in range(0, len(words)-1):
        for end in range(start+2, len(words)+1):
            yield ' '.join(words[start:end])

text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
    print phrase

Output:

the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat
like image 41
tobias_k Avatar answered Nov 07 '22 19:11

tobias_k