I'm trying to print phrases in a given text. I want to be able to print every phrase in the text, from 2 words up to the maximum number of words the length of the text will allow for. I have written a program below that prints all phrases up to 5 words in length, but I can't work out a more elegant way to get it print all possible phrases.
My definition of phrase = Consecutive words in a string, regardless of meaning.
def phrase_builder(i):
phrase_length = 4
phrase_list = []
for x in range(0, len(i)-phrase_length):
phrase_list.append(str(i[x]) + " " + str(i[x+1]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]))
phrase_list.append(str(i[x]) + " " + str(i[x+1]) + " " + str(i[x+2]) + " " + str(i[x+3]) + " " + str(i[x+4]))
return phrase_list
text = "the big fat cat sits on the mat eating a rat"
print phrase_builder(text.split())
The output for this is:
['the big', 'the big fat', 'the big fat cat', 'the big fat cat sits',
'big fat', 'big fat cat', 'big fat cat sits', 'big fat cat sits on',
'fat cat', 'fat cat sits', 'fat cat sits on', 'fat cat sits on the',
'cat sits', 'cat sits on', 'cat sits on the', 'cat sits on the mat',
'sits on', 'sits on the', 'sits on the mat', 'sits on the mat eating',
'on the', 'on the mat', 'on the mat eating', 'on the mat eating a',
'the mat', 'the mat eating', 'the mat eating a', 'the mat eating a rat']
I want to be able to print phrases such as "the big fat cat sits on the mat eating"
and "fat cat sits on the mat eating a rat"
etc.
Can anyone offer some advice please?
Simply Use itertools.combinations
from itertools import combinations
text = "the big fat cat sits on the mat eating a rat"
lst = text.split()
for start, end in combinations(range(len(lst)), 2):
print lst[start:end+1]
output:
['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['the', 'big', 'fat', 'cat', 'sits', 'on']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['the', 'big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']
['big', 'fat', 'cat', 'sits', 'on']
['big', 'fat', 'cat', 'sits', 'on', 'the']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['big', 'fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['fat', 'cat']
['fat', 'cat', 'sits']
['fat', 'cat', 'sits', 'on']
['fat', 'cat', 'sits', 'on', 'the']
['fat', 'cat', 'sits', 'on', 'the', 'mat']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['fat', 'cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['cat', 'sits']
['cat', 'sits', 'on']
['cat', 'sits', 'on', 'the']
['cat', 'sits', 'on', 'the', 'mat']
['cat', 'sits', 'on', 'the', 'mat', 'eating']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a']
['cat', 'sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['sits', 'on']
['sits', 'on', 'the']
['sits', 'on', 'the', 'mat']
['sits', 'on', 'the', 'mat', 'eating']
['sits', 'on', 'the', 'mat', 'eating', 'a']
['sits', 'on', 'the', 'mat', 'eating', 'a', 'rat']
['on', 'the']
['on', 'the', 'mat']
['on', 'the', 'mat', 'eating']
['on', 'the', 'mat', 'eating', 'a']
['on', 'the', 'mat', 'eating', 'a', 'rat']
['the', 'mat']
['the', 'mat', 'eating']
['the', 'mat', 'eating', 'a']
['the', 'mat', 'eating', 'a', 'rat']
['mat', 'eating']
['mat', 'eating', 'a']
['mat', 'eating', 'a', 'rat']
['eating', 'a']
['eating', 'a', 'rat']
['a', 'rat']
First, you need to figure out how to write all four of those lines the same way. Instead of concatenating the words and spaces manually, use the join
method:
phrase_list.append(" ".join(str(i[x+y]) for y in range(2))
phrase_list.append(" ".join(str(i[x+y]) for y in range(3))
phrase_list.append(" ".join(str(i[x+y]) for y in range(4))
phrase_list.append(" ".join(str(i[x+y]) for y in range(5))
If the comprehension inside the join
method isn't clear, here's how to write it manually:
phrase = []
for y in range(2):
phrase.append(str(i[x+y]))
phrase_list.append(" ".join(phrase))
Once you've done that, it's trivial to replace those four lines with a loop:
for length in range(2, phrase_length):
phrase_list.append(" ".join(str(i[x+y]) for y in range(length))
You can simplify this in a couple of other ways independently.
First, i[x+y] for y in range(length)
can be done much more easily with a slice: i[x:x+length]
.
And I'm guessing i
is already a list of strings, so you can get rid of the str
calls.
Also, range
defaults to starting at 0
, so you can leave that off.
While we're at it, it would be a lot easier to think about your code if you used meaningful variable names, like words
instead of i
.
So:
def phrase_builder(words):
phrase_length = 4
phrase_list = []
for i in range(len(words) - phrase_length):
phrase_list.append(" ".join(words[i:i+phrase_length]))
return phrase_list
And now your loop is simple enough that you can turn it into a comprehension and the whole thing is a one-liner:
def phrase_builder(words):
phrase_length = 4
return [" ".join(words[i:i+phrase_length])
for i in range(len(words) - phrase_length)]
One last thing: As @SoundDefense asked, are you sure you don't want "eating a rat"? It starts less than 5 words from the end, but it's a 3-word phrase in the text.
If you do want that, just remove the - phrase_length
part.
I think the simplest approach is to iterate over all the possible start
and end
positions in the list of words
and generate the phrases for the respective sub-lists of words:
def phrase_builder(words):
for start in range(0, len(words)-1):
for end in range(start+2, len(words)+1):
yield ' '.join(words[start:end])
text = "the big fat cat sits on the mat eating a rat"
for phrase in phrase_builder(text.split()):
print phrase
Output:
the big
the big fat
...
the big fat cat sits on the mat eating a rat
...
sits on the mat eating a
...
eating a rat
a rat
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With