I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs: Example corpus:
my_corpus = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']
model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)
from gensim import Word2Vec
from gensim.similarities import WmdSimilarity
def init_instance(my_corpus,model,num_best):
instance = WmdSimilarity(my_corpus, model,num_best = 1)
return instance
instance[my_tokenized_query]
the best matched document is "Human machine interface for lab abc computer applications"
which is great.
However the function instance
above takes an extremely long time. So I thought of breaking up the corpus into N
parts and then doing WMD
on each with num_best = 1
, then at the end of it, the part with the max score will be the most similar.
from multiprocessing import Process, Queue ,Manager
def main( my_query,global_jobs,process_tmp):
process_query = gensim.utils.simple_preprocess(my_query)
def worker(num,process_query,return_dict):
instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
x = instance[process_query][0][0]
y = instance[process_query][0][1]
return_dict[x] = y
manager = Manager()
return_dict = manager.dict()
for num in range(num_workers):
process_tmp = Process(target=worker, args=(num,process_query,return_dict))
global_jobs.append(process_tmp)
process_tmp.start()
for proc in global_jobs:
proc.join()
return_dict = dict(return_dict)
ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
print corpus[ind]
>>> "Graph minors A survey"
The problem I have with this is that, even though it outputs something, it doesn't give me a good similar query from my corpus even though it gets the max similarity of all the parts.
Am I doing something wrong?
Comment: chunk is a static variable: e.g. chunk = 600 ...
If you define chunk
static, then you have to compute num_workers
.
10001 / 600 = 16,6683333333 = 17 num_workers
It's common to use not more process
than cores
you have.
If you have 17 cores
, that's ok.
cores
are static, therefore you should:
num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)
Not the same result, changed to:
#process_query = gensim.utils.simple_preprocess(my_query)
process_query = my_tokenized_query
All worker
results Index 0..n.
Therefore, return_dict[x]
could be overwritten from last worker with same Index having lower value. The Index in return_dict is NOT the same as Index in my_corpus
. Changed to:
#return_dict[x] = y
return_dict[ (num * chunk)+x ] = y
Using +1
in chunk size computing, will skip that first Document.
I don't know how you compute chunk
, consider this example:
def chunksize(iterable, num_workers):
c_size, extra = divmod(len(iterable), num_workers)
if extra:
c_size += 1
if len(iterable) == 0:
c_size = 0
return c_size
#Usage
chunk = chunksize(my_corpus, num_workers)
...
#my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)
With
multiprocessing
, withchunk=5
:
02,09:(3, 8), 01,03:(3, 5):
System and human system engineering testing of EPS
04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
Human machine interface for lab abc computer applicationsWithout
multiprocessing
, withchunk=5
:
01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
System and human system engineering testing of EPS
03,04,06:(0, 5):
Human machine interface for lab abc computer applicationsWithout
multiprocessing
, without chunking:
01,02,03,04,06,07,08:(3, -1):
System and human system engineering testing of EPS
05,09,10:(0, -1):
Human machine interface for lab abc computer applications
Tested with Python: 3.4.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With