Here is the code for my function:
def calcVowelProportion(wordList):
"""
Calculates the proportion of vowels in each word in wordList.
"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
Right now, I'm working with a list of over 87,000 words and this algorithm is obviously extremely slow.
Is there a better way to do this?
EDIT:
I tested the algorithms @ExP provided with the following class:
import time
class vowelProportions(object):
"""
A series of methods that all calculate the vowel/word length ratio
in a list of words.
"""
WORDLIST_FILENAME = "words_short.txt"
def __init__(self):
self.wordList = self.buildWordList()
print "Original: " + str(self.calcMeanTime(10000, self.cvpOriginal, self.wordList))
print "Generator: " + str(self.calcMeanTime(10000, self.cvpGenerator, self.wordList))
print "Count: " + str(self.calcMeanTime(10000, self.cvpCount, self.wordList))
print "Translate: " + str(self.calcMeanTime(10000, self.cvpTranslate, self.wordList))
def buildWordList(self):
inFile = open(self.WORDLIST_FILENAME, 'r', 0)
wordList = []
for line in inFile:
wordList.append(line.strip().lower())
return wordList
def cvpOriginal(self, wordList):
""" My original, slow algorithm"""
VOWELS = 'aeiou'
ratios = []
for word in wordList:
numVowels = 0
for char in word:
if char in VOWELS:
numVowels += 1
ratios.append(numVowels/float(len(word)))
return ratios
def cvpGenerator(self, wordList):
""" Using a generator expression """
return [sum(char in 'aeiou' for char in word)/float(len(word)) for word in wordList]
def cvpCount(self, wordList):
""" Using str.count() """
return [sum(word.count(char) for char in 'aeiou')/float(len(word)) for word in wordList]
def cvpTranslate(self, wordList):
""" Using str.translate() """
return [len(word.translate(None, 'bcdfghjklmnpqrstxyz'))/float(len(word)) for word in wordList]
def timeFunc(self, func, *args):
start = time.clock()
func(*args)
return time.clock() - start
def calcMeanTime(self, numTrials, func, *args):
times = [self.timeFunc(func, *args) for x in range(numTrials)]
return sum(times)/len(times)
The output was (for a list of 200 words):
Original: 0.0005613667
Generator: 0.0008402738
Count: 0.0012531976
Translate: 0.0003343548
Surprisingly, Generator and Count were even slower than the original (please let me know if my implementation was incorrect).
I would like to test @John's solution, but don't know anything about trees.
Since you're just concerned with the ratio of vowels to letters in each word, you could first replace all of the vowels with a
. Now you can try a couple of things that might be faster:
a
) to non-vowel. This is a tree structure. The number of letters in the word is the level of the tree. The number of vowels is the number of left branches.You should optimize the innermost loop.
I'm pretty sure there are several alternative approaches. Here is what I can come up with right now. I'm not sure how they will compare in speed (with respect to each other and to your solution).
Using a generator expression:
numVowels = sum(x in 'aeiou' for x in word)
Using str.count()
:
numVowels = sum(word.count(x) for x in 'aeiou')
Using str.translate()
(assuming there are no capital letters or special symbols):
numVowels = len(word.translate(None, 'bcdfghjklmnpqrstxyz'))
With all of these, you can even write the whole function in a single line without list.append()
.
I would be curious to know which turns out to be the fastest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With