Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preceding Word Length

Tags:

python

I have to make a function that takes a single argument word and returns the average length(in characters) of the word that precedes word in the text. If word happens to be the first word occurring in the text, the length of the preceding word for that occurrence should be zero. For example

>>> average_length("the")
4.4
>>> average_length('whale')
False
average_length('ship.')
3.0 

This is what I have written so far,

def average_length(word):
    text = "Call me Ishmael. Some years ago - never mind how long..........."
    words = text.split()
    wordCount = len(words)

    Sum = 0
    for word in words:
        ch = len(word)
        Sum = Sum + ch
    avg = Sum/wordCount
    return avg

I know this isn't right at all but I'm having trouble of how to approach this correctly. This question is asking me to find every instance of word in the text, and when you do, calculate the length of the word immediately before it in the text. Not every word from beginning to that word, just one.

I should have also mentioned that all the tests will only test my code using the first paragraph from 'Moby Dick':

"Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

like image 893
RoadRunner Avatar asked Mar 21 '16 09:03

RoadRunner


2 Answers

It seems like you can save a lot of computation time by going over your data only once:

from collections import defaultdict
prec = defaultdict(list)
text = "Call me Ishmael. Some years ago..".split()

Create two iterators over your list. We call next on the second one, so that from now on, whenever we get an element from the iterator, we get a word and its successor.

first, second = iter(text), iter(text)
next(second)

Zipping over the two iterators ("abc","def""ad", "be", "cf"), we append the length of the first word to the list of predecessor lengths of the second. That works, because we're using a defaultdict(list), which returns an empty list for any not yet existing key.

for one, two in zip(first, second):  # pairwise
    prec[two].append(len(one))

Finally, we can create a new dictionary from words to the mean of their predecessor's lengths: Sum divided by length. Instead of this dictionary comprehension, you could also use a normal for loop.

# avg_prec_len = {key: sum(prec[key]) / len(prec[key]) for key in prec}
avg_prec_len = {}
for key in prec:
    # prec[key] is a list of lengths
    avg[key] = sum(prec[key]) / len(prec[key])

Then you can just look it up in that dictionary.

(If you're using Python 2, use izip instead of zip, and do from __future__ import division).

like image 61
L3viathan Avatar answered Oct 12 '22 23:10

L3viathan


Based on your requirements for no imports and a simple approach, the following function does it without any, the comments and variable names should make the function logic pretty clear:

def match_previous(lst, word):
    # keep matches_count of how many times we find a match and total lengths
    matches_count = total_length_sum = 0.0
    # pull first element from list to use as preceding word
    previous_word = lst[0]
    # slice rest of words from the list 
    # so we always compare two consecutive words
    rest_of_words = lst[1:]
    # catch where first word is "word" and add 1 to matches_count
    if previous_word == word:
        matches_count += 1
    for current_word in rest_of_words:
        # if the current word matches our "word"
        # add length of previous word to total_length_sum
        # and increase matches_count.
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
        # always update to keep track of word just seen
        previous_word = current_word
    # if  matches_count is 0 we found no word in the text that matched "word"
    return total_length_sum / matches_count if matches_count else False

It takes two args, the split list of words and the word to search for:

In [41]: text = "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to previous_wordent me from deliberately stepping into the street, and methodically knocking people's hats off - then, I acmatches_count it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

In [42]: match_previous(text.split(),"the")
Out[42]: 4.4

In [43]: match_previous(text.split(),"ship.")
Out[43]: 3.0

In [44]: match_previous(text.split(),"whale")
Out[44]: False

In [45]: match_previous(text.split(),"Call")
Out[45]: 0.0

You can obviously do the same as your own function, take a single arg do the split text in the function. The only way False will be returned will be if we found no match for the word, you can see call returns 0.0 as it is the first word in the text.

If we add some prints to the code and use enumerate:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    previous_word = lst[0]
    rest_of_words = lst[1:]
    if previous_word == word:
        print("First word matches.")
        matches_count += 1
    for ind, current_word in enumerate(rest_of_words, 1):
        print("On iteration {}.\nprevious_word = {} and current_word = {}.".format(ind, previous_word, current_word))
        if word == current_word:
            total_length_sum += len(previous_word)
            matches_count += 1
            print("We found a match at index {} in our list of words.".format(ind-1))
        print("Updating previous_word from {} to {}.".format(previous_word, current_word))
        previous_word = current_word
    return total_length_sum / matches_count if matches_count else False

And run it with a small sample list we can see what happens:

In [59]: match_previous(["bar","foo","foobar","hello", "world","bar"],"bar")
First word matches.
On iteration 1.
previous_word = bar and current_word = foo.
Updating previous_word from bar to foo.
On iteration 2.
previous_word = foo and current_word = foobar.
Updating previous_word from foo to foobar.
On iteration 3.
previous_word = foobar and current_word = hello.
Updating previous_word from foobar to hello.
On iteration 4.
previous_word = hello and current_word = world.
Updating previous_word from hello to world.
On iteration 5.
previous_word = world and current_word = bar.
We found a match at index 4 in our list of words.
Updating previous_word from world to bar.
Out[59]: 2.5

The advantage to use iter is we don't need to create a new list by slicing the remainder, to use it in the code you would just need to change the start of the function to:

def match_previous(lst, word):
    matches_count = total_length_sum = 0.0
    # create an iterator
    _iterator = iter(lst)
    # pull first word from iterator
    previous_word = next(_iterator)
    if previous_word == word:
        matches_count += 1
    # _iterator will give us all bar the first word we consumed with  next(_iterator)
    for current_word in _iterator:

Each time you consume an element from an iterator, we move to the next element:

In [61]: l = [1,2,3,4]

In [62]: it = iter(l)

In [63]: next(it)
Out[63]: 1

In [64]: next(it)
Out[64]: 2
# consumed two of four so we are left with two
In [65]: list(it)
Out[65]: [3, 4]

The only way a dict really makes sense is if you take multiple words to your function which you can do with *args:

def sum_previous(text):
    _iterator = iter(text.split())
    previous_word = next(_iterator)
    # set first k/v pairing with the first word
    # if  "total_lengths" is 0 at the end we know there
    # was only one match at the very start
    avg_dict = {previous_word: {"count": 1.0, "total_lengths": 0.0}}
    for current_word in _iterator:
        # if key does not exist, it creates a new key/value pairing
        avg_dict.setdefault(current_word, {"count": 0.0, "total_lengths": 0.0})
        # update value adding word length and increasing the count
        avg_dict[current_word]["total_lengths"] += len(previous_word)
        avg_dict[current_word]["count"] += 1
        previous_word = current_word
    # return the dict so we can use it outside the function.
    return avg_dict


def match_previous_generator(*args):
    # create our dict mapping words to sum of all lengths of their preceding words.
    d = sum_previous(text)
    # for every word we pass to the function.
    for word in args:
        # use dict.get with a default of an empty dict.
        #  to catch when a word is not in out text.
        count = d.get(word, {}).get("count")
        # yield each word and it's avg or False for non existing words.
        yield (word, d[word]["total_lengths"] / count if count else False)

Then just pass in the text and all the words you want to search for, you can call list on the generator function:

In [69]: list(match_previous_generator("the","Call", "whale", "ship."))
Out[69]: [('the', 4.4), ('Call', 0.0), ('whale', False), ('ship.', 3.0)]

Or iterate over it:

In [70]: for tup in match_previous_generator("the","Call", "whale", "ship."):
   ....:     print(tup)
   ....:     
('the', 4.4)
('Call', 0.0)
('whale', False)
('ship.', 3.0)
like image 24
Padraic Cunningham Avatar answered Oct 12 '22 23:10

Padraic Cunningham