Given the big.txt
from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times).
According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal:
_bigrams = zip(*[text[i:] for i in range(2)])
And if I'm using Python3
, the generator won't be evaluated until i materialize it with list(_bigrams)
or some other functions that will do the same.
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(_bigrams).most_common(100)
# Do some manipulation to text and repeat the counting.
text = manipulate(text, top100)
But that takes around 1+ seconds per iteration, and 100,000 iterations would be too long.
I've also tried sklearn
CountVectorizer but the time to extract, count and get the top100 bigrams are comparable to the native python.
Then I've experimented with some multiprocessing
, using slight modification from Python multiprocessing and a shared counter and http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing:
from multiprocessing import Process, Manager, Lock
import time
class MultiProcCounter(object):
def __init__(self):
self.dictionary = Manager().dict()
self.lock = Lock()
def increment(self, item):
with self.lock:
self.dictionary[item] = self.dictionary.get(item, 0) + 1
def func(counter, item):
counter.increment(item)
def multiproc_count(inputs):
counter = MultiProcCounter()
procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
for p in procs: p.start()
for p in procs: p.join()
return counter.dictionary
inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]
print (multiproc_count(inputs))
But using the MultiProcCounter
in the bigram counting takes even longer than 1+ seconds per iteration. I've no idea why is that the case, using the dummy list of int
example, the multiproc_count
works perfectly.
I've tried:
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(multiproc_count(_bigrams)).most_common(100)
Is there any way to count the bigrams really fast in Python?
It is used to significantly speed up your program, especially if it has a lot of CPU extensive tasks. In this case, multiple functions can run together because each one will use a different CPU core which in turn will improve the CPU utilization.
Multiprocessing refers to a system that has more than two central processing units (CPUs). Every additional CPU added to a system increases its speed, power and memory. This allows users to run multiple processes simultaneously.
If your program is IO-bound, both multithreading and multiprocessing in Python will work smoothly. However, If the code is CPU-bound and your machine has multiple cores, multiprocessing would be a better choice.
multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads.
import os, thread
text = 'I really like cheese' #just load whatever you want here, this is just an example
CORE_NUMBER = os.cpu_count() # may not be available, just replace with how many cores you have if it crashes
ready = []
bigrams = []
def extract_bigrams(cores):
global ready, bigrams
bigrams = []
ready = []
for a in xrange(cores): #xrange is best for performance
bigrams.append(0)
ready.append(0)
cpnt = 0#current point
iterator = int(len(text)/cores)
for a in xrange(cores-1):
thread.start_new(extract_bigrams2, (cpnt, cpnt+iterator+1, a)) #overlap is intentional
cpnt += iterator
thread.start_new(extract_bigrams2, (cpnt, len(text), a+1))
while 0 in ready:
pass
def extract_bigrams2(startpoint, endpoint, threadnum):
global ready, bigrams
ready[threadnum] = 0
bigrams[threadnum] = zip(*[text[startpoint+i:endpoint] for i in xrange(2)])
ready[threadnum] = 1
extract_bigrams(CORE_NUMBER)
thebigrams = []
for a in bigrams:
thebigrams+=a
print thebigrams
There are some issues with this program, like it not filtering out whitespace or punctuation, but I made this program to show what you should be shooting for. You can easily edit it to suit your needs.
This program auto-detects how many cores your computer has, and creates that number of threads, attempting to evenly distribute the areas where it looks for bigrams. I've only been able to test this code in an online browser on a school owned computer, so I can't be certain this works completely. If there are any problems or questions, please leave them in the comments.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With