Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating vowel to word length ratio in a list of words

Tags:

python

Here is the code for my function:

def calcVowelProportion(wordList):
    """
    Calculates the proportion of vowels in each word in wordList.
    """

    VOWELS = 'aeiou'
    ratios = []

    for word in wordList:
        numVowels = 0
        for char in word:
            if char in VOWELS:
                numVowels += 1
        ratios.append(numVowels/float(len(word)))

Right now, I'm working with a list of over 87,000 words and this algorithm is obviously extremely slow.

Is there a better way to do this?

EDIT:

I tested the algorithms @ExP provided with the following class:

    import time

    class vowelProportions(object):
        """
        A series of methods that all calculate the vowel/word length ratio
        in a list of words.
        """

        WORDLIST_FILENAME = "words_short.txt"

        def __init__(self):
            self.wordList = self.buildWordList()
            print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
            print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
            print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
            print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))

        def buildWordList(self):
            inFile = open(self.WORDLIST_FILENAME, 'r', 0)
            wordList = []
            for line in inFile:
                wordList.append(line.strip().lower())
            return wordList

        def cvpOriginal(self, wordList):
            """ My original, slow algorithm"""
            VOWELS = 'aeiou'
            ratios = []

            for word in wordList:
                numVowels = 0
                for char in word:
                    if char in VOWELS:
                        numVowels += 1
                ratios.append(numVowels/float(len(word)))

            return ratios

        def cvpGenerator(self, wordList):
            """ Using a generator expression """
            return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]

        def cvpCount(self, wordList):
            """ Using str.count() """
            return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]

        def cvpTranslate(self, wordList):
            """ Using str.translate() """
            return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]

        def timeFunc(self, func, *args):
            start = time.clock()
            func(*args)
            return time.clock() - start

        def calcMeanTime(self, numTrials, func, *args):
            times = [self.timeFunc(func, *args) for x in range(numTrials)]
            return sum(times)/len(times)

The output was (for a list of 200 words):

Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548

Surprisingly, Generator and Count were even slower than the original (please let me know if my implementation was incorrect).

I would like to test @John's solution, but don't know anything about trees.

like image 505
paulwithap Avatar asked Apr 22 '13 20:04

paulwithap


2 Answers

Since you're just concerned with the ratio of vowels to letters in each word, you could first replace all of the vowels with a. Now you can try a couple of things that might be faster:

  • You're testing for one letter instead of five at each step. That's bound to be faster.
  • You might be able to sort the whole list and search for the points where you go from vowel (now represented categorically as a) to non-vowel. This is a tree structure. The number of letters in the word is the level of the tree. The number of vowels is the number of left branches.
like image 189
John Avatar answered Oct 21 '22 02:10

John


You should optimize the innermost loop.

I'm pretty sure there are several alternative approaches. Here is what I can come up with right now. I'm not sure how they will compare in speed (with respect to each other and to your solution).

  • Using a generator expression:

    numVowels = sum(x in 'aeiou' for x in word)
    
  • Using str.count():

    numVowels = sum(word.count(x) for x in 'aeiou')
    
  • Using str.translate() (assuming there are no capital letters or special symbols):

    numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
    

With all of these, you can even write the whole function in a single line without list.append().

I would be curious to know which turns out to be the fastest.

like image 42
Elmar Peise Avatar answered Oct 21 '22 03:10

Elmar Peise