Counting bigrams real fast (with or without multiprocessing) - python

Tags:

Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times).

According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal:

_bigrams = zip(*[text[i:] for i in range(2)])

And if I'm using Python3, the generator won't be evaluated until i materialize it with list(_bigrams) or some other functions that will do the same.

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True: 
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(_bigrams).most_common(100)
    # Do some manipulation to text and repeat the counting.
    text = manipulate(text, top100)

But that takes around 1+ seconds per iteration, and 100,000 iterations would be too long.

I've also tried sklearn CountVectorizer but the time to extract, count and get the top100 bigrams are comparable to the native python.

Then I've experimented with some multiprocessing, using slight modification from Python multiprocessing and a shared counter and http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:

from multiprocessing import Process, Manager, Lock

import time

class MultiProcCounter(object):
    def __init__(self):
        self.dictionary = Manager().dict()
        self.lock = Lock()

    def increment(self, item):
        with self.lock:
            self.dictionary[item] = self.dictionary.get(item, 0) + 1

def func(counter, item):
    counter.increment(item)

def multiproc_count(inputs):
    counter = MultiProcCounter()
    procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
    for p in procs: p.start()
    for p in procs: p.join()
    return counter.dictionary

inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]

print (multiproc_count(inputs))

But using the MultiProcCounter in the bigram counting takes even longer than 1+ seconds per iteration. I've no idea why is that the case, using the dummy list of int example, the multiproc_count works perfectly.

I've tried:

import io
from collections import Counter

import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
     text = fin.read().lower().replace(u' ', u"\uE000")

while True:
    _bigrams = zip(*[text[i:] for i in range(2)])
    start = time.time()
    top100 = Counter(multiproc_count(_bigrams)).most_common(100)

Is there any way to count the bigrams really fast in Python?

328

asked Nov 02 '16 06:11

alvas

1 Answers

import os, thread

text = 'I really like cheese' #just load whatever you want here, this is just an example

CORE_NUMBER = os.cpu_count() # may not be available, just replace with how many cores you have if it crashes

ready = []
bigrams = []

def extract_bigrams(cores):
    global ready, bigrams
    bigrams = []
    ready = []
    for a in xrange(cores): #xrange is best for performance
        bigrams.append(0)
        ready.append(0)
    cpnt = 0#current point
    iterator = int(len(text)/cores)
    for a in xrange(cores-1):
        thread.start_new(extract_bigrams2, (cpnt, cpnt+iterator+1, a)) #overlap is intentional
        cpnt += iterator
    thread.start_new(extract_bigrams2, (cpnt, len(text), a+1))
    while 0 in ready:
        pass

def extract_bigrams2(startpoint, endpoint, threadnum):
    global ready, bigrams
    ready[threadnum] = 0
    bigrams[threadnum] = zip(*[text[startpoint+i:endpoint] for i in xrange(2)])
    ready[threadnum] = 1

extract_bigrams(CORE_NUMBER)
thebigrams = []
for a in bigrams:
    thebigrams+=a

print thebigrams

There are some issues with this program, like it not filtering out whitespace or punctuation, but I made this program to show what you should be shooting for. You can easily edit it to suit your needs.

This program auto-detects how many cores your computer has, and creates that number of threads, attempting to evenly distribute the areas where it looks for bigrams. I've only been able to test this code in an online browser on a school owned computer, so I can't be certain this works completely. If there are any problems or questions, please leave them in the comments.

195

answered Sep 28 '22 00:09

Douglas

Related questions
                            
                                python tox, creating rpm virtualenv, as part of ci pipeline, unsure about where in workflow
                            
                                Invalid or expired token. Request new token via Tweepy?
                            
                                Generating reachability matrix from a given adjacency matrix
                            
                                binary field download link use in treeview or listview inside one2many field in Odoo
                            
                                Making global variable accessible from every process
                            
                                Unequal misclassification costs in python/sklearn
                            
                                Override the shebang mangling in python setuptools
                            
                                Forcing Tkinter window to stay on top of fullscreen - Windows 10
                            
                                Ignore Lock in MYSQL Database in Sqlalchemy Query
                            
                                How to change the kernel name in Jupyter Notebook running on Windows?
                            
                                Tensorflow: Convert Tensor to numpy array WITHOUT .eval() or sess.run()
                            
                                Fastest Way to Compress Video Size Using Library or Algo
                            
                                How to implement Weighted Binary CrossEntropy on theano?
                            
                                How do you control user access to records in a key-value database?
                            
                                Python scrapy ReactorNotRestartable substitute
                            
                                Python break from if statement to else
                            
                                How to disable sorting by primary key in Django Admin?
                            
                                Python Bokeh table columns and headers don't line up
                            
                                How do I implement the Triplet Loss in Keras?
                            
                                Pandas returning empty groups in groupby

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Counting bigrams real fast (with or without multiprocessing) - python

Tags:

python

optimization

counter

mapreduce

n-gram

alvas

People also ask

1 Answers

Douglas

Recent Activity

Donate For Us