I have two text files both consisting of approximately 700,000 lines.
Second file consists of responses to statements in the first file for corresponding line.
I need to calculate Fisher's Exact Score for each word pair that appears on matching lines.
For example, if nth lines in the files are
how are you
and
fine thanx
then I need to calculate Fisher's score for (how,fine), (how,thanx), (are,fine), (are,thanx), (you,fine), (you,thanx).
In order to calculate Fisher's Exact Score, I used collections module's Counter to count the number of appearances of each word, and their co-appearances throughout the two files, as in
with open("finalsrc.txt") as f1, open("finaltgt.txt") as f2:
for line1, line2 in itertools.izip(f1, f2):
words1 = list(set(list(find_words(line1))))
words2 = list(set(list(find_words(line2))))
counts1.update(words1)
counts2.update(words2)
counts_pair.update(list(set(list(itertools.product(words1, words2)))))
then I calculate the Fisher's exact score for each pair using scipy module by
from scipy import stats
def calculateFisher(s, t):
sa = counts1[s]
ta = counts2[t]
st = counts_pair[s, t]
snt = sa - st
nst = ta - st
nsnt = n - sa - ta + st
oddsratio, pvalue = stats.fisher_exact([[st, snt], [nst, nsnt]])
return pvalue
This works fast and fine for small text files, but since my files contain 700,000 lines each, I think the Counter gets too large to retrieve the values quickly, and this becomes very very slow.
(Assuming 10 words per each sentence, the counts_pair would have (10^2)*700,000=70,000,000 entries.)
It would take tens of days to finish the computation for all word pairs in the files.
What would be the smart workaround for this?
I would greatly appreciate your help.
How exactly are you calling the calculateFisher function? Your counts_pair will not have 70 million entries: a lot of word pairs will be seen more than once, so seventy million is the sum of their counts, not the number of keys. You should be only calculating the exact test for pairs that do co-occur, and the best place to find those is in counts_pair. But that means that you can just iterate over it; and if you do, you never have to look anything up in counts_pair:
for (s, t), count in counts_pair.iteritems():
sa = counts1[s]
ta = counts2[t]
st = count
# Continue with Fisher's exact calculation
I've factored out the calculate_fisher function for clarity; I hope you get the idea. So if dictionary look-ups were what's slowing you down, this will save you a whole lot of them. If not, ... do some profiling and let us know what's really going on.
But note that simply looking up keys in a huge dictionary shouldn't slow things down too much. However, "retrieving values quickly" will be difficult if your program must to swap most of its data to disk. Do you have enough memory in your computer to hold the three counters simultaneously? Does the first loop complete in a reasonable amount of time? So find the bottleneck and you'll know more about what needs fixing.
Edit: From your comment it sounds like you are calculating Fisher's exact score over and over during a subsequent step of text processing. Why do that? Break up your program in two steps: First, calculate all word pair scores as I describe. Write each pair and score out into a file as you calculate it. When that's done, use a separate script to read them back in (now the memory contains nothing else but this one large dictionary of pairs & Fisher's exact scores), and rewrite away. You should do that anyway: If it takes you ten days just to get the scores (and you *still haven't given us any details on what's slow, and why), get started and in ten days you'll have them forever, to use whenever you wish.
I did a quick experiment, and a python process with a list of a million ((word, word), count) tuples takes just 300MB (on OS X, but the data structures should be about the same size on Windows). If you have 10 million distinct word pairs, you can expect it to take about 2.5 GB of RAM. I doubt you'll have even this many word pairs (but check!). So if you've got 4GB of RAM and you're not doing anything wrong that you haven't told us about, you should be all right. Otherwise, YMMV.
I think that your bottleneck is in how you manipulate the data structures other than the counters.
words1 = list(set(list(find_words(line1)))) creates a list from a set from a list from the result of find_words. Each of these operations requires allocating memory to hold all of your objects, and copying. Worse still, if the type returned by find_words does not include a __len__ method, the resulting list will have to grow and be recopied as it iterates.
I'm assuming that all you need is an iterable of unique words in order to update your counters, for which set will be perfectly sufficient.
for line1, line2 in itertools.izip(f1, f2):
words1 = set(find_words(line1)) # words1 now has list of unique words from line1
words2 = set(find_words(line2)) # words2 now has list of unique words from line2
counts1.update(words1) # counts1 increments words from line1 (once per word)
counts2.update(words2) # counts2 increments words from line2 (once per word)
counts_pair.update(itertools.product(words1, words2)
Note that you don't need to change the output of itertools.product that is passed to counts_pair as there are no repeated elements in words1 or words2, so the Cartesian product will not have any repeated elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With