Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effective 1-5 grams extraction with python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:

This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .

Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:

#!/usr/bin/env python -*- coding: utf-8 -*-

import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')

with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
    # Extract words from file. 
    src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
    del src_words[-1] # Removes the final '<s>'
    trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
    del trg_words[-1] # Removes the final '<s>'

    # Unigrams count.
    src_unigrams = Counter(src_words) 
    trg_unigrams = Counter(trg_words) 
    # Sum of unigram counts.
    src_sum_unigrams = sum(src_unigrams.values())
    trg_sum_unigrams = sum(trg_unigrams.values())

    # Bigrams count.
    src_bigrams = Counter(zip(src_words,src_words[1:]))
    trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
    # Sum of bigram counts.
    src_sum_bigrams = sum(src_bigrams.values())
    trg_sum_bigrams = sum(trg_bigrams.values())

    # Trigrams count.
    src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
    trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
    # Sum of trigram counts.
    src_sum_trigrams = sum(src_bigrams.values())
    trg_sum_trigrams = sum(trg_bigrams.values())

Is there any other way to do this more efficiently?

How to optimally extract different N ngrams simultaneously?

From Fast/Optimize N-gram implementations in python, essentially this:

zip(*[words[i:] for i in range(n)])

when hard-coded is this for bigrams, n=2:

zip(src_words,src_words[1:])

and is this for trigrams, n=3:

zip(src_words,src_words[1:],src_words[2:])
like image 945
alvas Avatar asked Oct 13 '14 13:10

alvas


2 Answers

If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.

The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.

Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.

The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.

like image 137
Tregoreg Avatar answered Sep 25 '22 03:09

Tregoreg


I am giving you a bunch of pointers regarding the general problems you are trying to solve.. One or more of these should be useful for you and help you figure this out.

For what you are doing (I am guessing some sort of machine translation experiment) you don't really need to load the two files srcfin and trgfin into memory at the same time (at least not for the code sample you have provided).. Processing them separately will be less expensive in terms of the amount of stuff you need to hold in memory at a given time.

You are reading a ton of data into memory, processing it (which takes even more memory), and then holding the results in some in-memory data-structures. Instead of doing that, you should strive to be lazier. Learn about python generators and write a generator which streams out all the ngrams from a given text without needing to hold the entire text in memory at any given point in time. The itertools python package will probably come in handy while writing this.

Beyond a point, it will no longer be feasible for you to hold all this data in memory. You should consider looking at map-reduce to help you break this down. Check out the mrjob python package which lets you write map reduce jobs in python. In the mapper step, you will break text down into its ngrams, and in the reducer stage you will count the number of times you see each ngram to get its overall count. mrjob's can also be run locally which obviously won't give you any parallelization benefits, but will be nice cause mrjob will still do a lot of heavy lifting for you.

If you are compelled to hold all the counts in memory at the same time (for a massive amount of text), then either implement some pruning strategy to prune out very rare ngrams, or consider using some persistent file-based lookup table such sqlite to hold all the data for you.

like image 37
Aditya Mukherji Avatar answered Sep 26 '22 03:09

Aditya Mukherji