Say I have a string of words: 'a b c d e f'
. I want to generate a list of multi-word terms from this string.
Word order matters. The term 'f e d'
shouldn't be generated from the above example.
Edit: Also, words should not be skipped. 'a c'
, or 'b d f'
shouldn't be generated.
What I have right now:
doc = 'a b c d e f'
terms= []
one_before = None
two_before = None
for word in doc.split(None):
terms.append(word)
if one_before:
terms.append(' '.join([one_before, word]))
if two_before:
terms.append(' '.join([two_before, one_before, word]))
two_before = one_before
one_before = word
for term in terms:
print term
Prints:
a
b
a b
c
b c
a b c
d
c d
b c d
e
d e
c d e
f
e f
d e f
How would I make this a recursive function so that I can pass it a variable maximum number of words per term?
Application:
I'll be using this to generate multi-word terms from readable text in HTML documents. The overall goal is a latent semantic analysis of a large corpus (about two million documents). This is why keeping word order matters (Natural Language Processing and whatnot).
A recursive implementation may have more than one base case, or more than one recursive step. For example, the Fibonacci function has two base cases, n=0 and n=1.
Writing a recursive function is almost the same as reading one: Create a regular function with a base case that can be reached with its parameters. Pass arguments into the function that immediately trigger the base case. Pass the next arguments that trigger the recursive call just once.
A classic example of recursionThe factorial of a number is computed as that number times all of the numbers below it up to and including 1. For example, factorial(5) is the same as 5*4*3*2*1 , and factorial(3) is 3*2*1 .
Just like regular functions, generators may be recursive, which means they can call themselves inside their function body.
This isn't recursive, but I think it does what you want.
doc = 'a b c d e f'
words = doc.split(None)
max = 3
for index in xrange(len(words)):
for n in xrange(max):
if index + n < len(words):
print ' '.join(words[index:index+n+1])
And here's a recursive solution:
def find_terms(words, max_words_per_term):
if len(words) == 0: return []
return [" ".join(words[:i+1]) for i in xrange(min(len(words), max_words_per_term))] + find_terms(words[1:], max_words_per_term)
doc = 'a b c d e f'
words = doc.split(None)
for term in find_terms(words, 3):
print term
Here's the recursive function again, with some explaining variables and comments.
def find_terms(words, max_words_per_term):
# If there are no words, you've reached the end. Stop.
if len(words) == 0:
return []
# What's the max term length you could generate from the remaining
# words? It's the lesser of max_words_per_term and how many words
# you have left.
max_term_len = min(len(words), max_words_per_term)
# Find all the terms that start with the first word.
initial_terms = [" ".join(words[:i+1]) for i in xrange(max_term_len)]
# Here's the recursion. Find all of the terms in the list
# of all but the first word.
other_terms = find_terms(words[1:], max_words_per_term)
# Now put the two lists of terms together to get the answer.
return initial_terms + other_terms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With