I am looking at the DeepDist
(link) module and thinking to combine it with Gensim
's Doc2Vec
API to train paragraph vectors on PySpark
. The link actually provides with the following clean example for how to do it for Gensim
's Word2Vec
model:
from deepdist import DeepDist
from gensim.models.word2vec import Word2Vec
from pyspark import SparkContext
sc = SparkContext()
corpus = sc.textFile('enwiki').map(lambda s: s.split())
def gradient(model, sentences): # executes on workers
syn0, syn1 = model.syn0.copy(), model.syn1.copy() # previous weights
model.train(sentences)
return {'syn0': model.syn0 - syn0, 'syn1': model.syn1 - syn1}
def descent(model, update): # executes on master
model.syn0 += update['syn0']
model.syn1 += update['syn1']
with DeepDist(Word2Vec(corpus.collect()) as dd:
dd.train(corpus, gradient, descent)
print dd.model.most_similar(positive=['woman', 'king'], negative=['man'])
To my understanding, DeepDist
is distributing the work of gradient descent into workers in batches, and the recombining them and updating at master. If I replace Word2Vec
with Doc2Vec
, there should be the document vectors that are being trained with the word vectors.
So I looked into the source code of gensim.models.doc2vec
(link). There are the following fields in the Doc2Vec
model instance:
model.syn0
model.syn0_lockf
model.docvecs.doctag_syn0
model.docvecs.doctag_syn0_lockf
Comparing with the source code of gensim.models.word2vec
(link), the following fields went missing in Doc2Vec
model:
model.syn1
model.syn1neg
I think I do not touch the lockf
vectors because they seem to be used after the training is done when new data points come in. Therefore my code should be something like
from deepdist import DeepDist
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from pyspark import SparkContext
sc = SparkContext()
# assume my dataset is in format 10-char-id followed by doc content
# 1 line per doc
corpus = sc.textFile('data_set').map(
lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10])
)
def gradient(model, sentence): # executes on workers
syn0, doctag_syn0 = model.syn0.copy(), model.docvecs.doctag_syn0.copy() # previous weights
model.train(sentence)
return {'syn0': model.syn0 - syn0, 'doctag_syn0': model.docvecs.doctag_syn0 - doctag_syn0}
def descent(model, update): # executes on master
model.syn0 += update['syn0']
model.docvecs.doctag_syn0 += update['doctag_syn0']
with DeepDist(Doc2Vec(corpus.collect()) as dd:
dd.train(corpus, gradient, descent)
print dd.model.most_similar(positive=['woman', 'king'], negative=['man'])
Am I missing anything important here? For example:
model.syn1
at all? What do they mean after all?model.*_lockf
is the locked matrices after training?lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10]
to parse my dataset, assuming I have each document in one line, prefixed by a 0-padded 10-digit id?Any suggestion/contribution are very appreciated. I will write up a blog post to summarize the result, mentioning contributors here, potentially to help others train Doc2Vec models on scaled distributed systems without spending much dev time trying to solve what I am solving now.
Thanks
Update 06/13/2018
My apologies as I did not get to implement this. But there are better options nowaday, and DeepDist
haven't been maintained for awhile now. Please read comment below.
If you insist on trying out my idea at the moment, be reminded you are proceeding with your own risk. Also, if someone knows that DeepDist
still works, please report back in comments. It would help other readers.
TaggedDocument. Bases: gensim.models.doc2vec.TaggedDocument. Represents a document along with a tag, input document format for Doc2Vec . A single document, made up of words (a list of unicode string tokens) and tags (a list of tokens).
Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method. In order to understand doc2vec, it is advisable to understand word2vec approach.
Doc2Vec is a Model that represents each Document as a Vector. This tutorial introduces the model and demonstrates how to train and assess it. Here's a list of what we'll be doing: Review the relevant models: bag-of-words, Word2Vec, Doc2Vec.
To avoid this question from remaining shown as open, here is how the asker resolved the situation:
I did not get to implement this, until it's too late that I didn't think it would work. DeepDist uses Flask app in backend to interact with Spark web interface. Since it wasn't maintained anymore, Spark's update very likely broke it already. If you are looking for Doc2Vec training in Spark, just go for Deeplearning4J(deeplearning4j.org/doc2vec#)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With