Doc2Vec and PySpark: Gensim Doc2vec over DeepDist

Tags:

I am looking at the DeepDist (link) module and thinking to combine it with Gensim's Doc2Vec API to train paragraph vectors on PySpark. The link actually provides with the following clean example for how to do it for Gensim's Word2Vec model:

from deepdist import DeepDist
from gensim.models.word2vec import Word2Vec
from pyspark import SparkContext

sc = SparkContext()
corpus = sc.textFile('enwiki').map(lambda s: s.split())

def gradient(model, sentences):  # executes on workers
    syn0, syn1 = model.syn0.copy(), model.syn1.copy()   # previous weights
    model.train(sentences)
    return {'syn0': model.syn0 - syn0, 'syn1': model.syn1 - syn1}

def descent(model, update):      # executes on master
    model.syn0 += update['syn0']
    model.syn1 += update['syn1']

with DeepDist(Word2Vec(corpus.collect()) as dd:
    dd.train(corpus, gradient, descent)
    print dd.model.most_similar(positive=['woman', 'king'], negative=['man'])

To my understanding, DeepDist is distributing the work of gradient descent into workers in batches, and the recombining them and updating at master. If I replace Word2Vec with Doc2Vec, there should be the document vectors that are being trained with the word vectors.

So I looked into the source code of gensim.models.doc2vec (link). There are the following fields in the Doc2Vec model instance:

model.syn0
model.syn0_lockf
model.docvecs.doctag_syn0
model.docvecs.doctag_syn0_lockf

Comparing with the source code of gensim.models.word2vec (link), the following fields went missing in Doc2Vec model:

model.syn1
model.syn1neg

I think I do not touch the lockf vectors because they seem to be used after the training is done when new data points come in. Therefore my code should be something like

from deepdist import DeepDist
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from pyspark import SparkContext

sc = SparkContext()

# assume my dataset is in format 10-char-id followed by doc content
# 1 line per doc
corpus = sc.textFile('data_set').map(
    lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10])
)

def gradient(model, sentence):  # executes on workers
    syn0, doctag_syn0 = model.syn0.copy(), model.docvecs.doctag_syn0.copy()   # previous weights
    model.train(sentence)
    return {'syn0': model.syn0 - syn0, 'doctag_syn0': model.docvecs.doctag_syn0 - doctag_syn0}

def descent(model, update):      # executes on master
    model.syn0 += update['syn0']
    model.docvecs.doctag_syn0 += update['doctag_syn0']

with DeepDist(Doc2Vec(corpus.collect()) as dd:
    dd.train(corpus, gradient, descent)
    print dd.model.most_similar(positive=['woman', 'king'], negative=['man'])

Am I missing anything important here? For example:

Should I care about model.syn1 at all? What do they mean after all?
Am I right that model.*_lockf is the locked matrices after training?
Is it ok that I use lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10] to parse my dataset, assuming I have each document in one line, prefixed by a 0-padded 10-digit id?

Any suggestion/contribution are very appreciated. I will write up a blog post to summarize the result, mentioning contributors here, potentially to help others train Doc2Vec models on scaled distributed systems without spending much dev time trying to solve what I am solving now.

Thanks

Update 06/13/2018

My apologies as I did not get to implement this. But there are better options nowaday, and DeepDist haven't been maintained for awhile now. Please read comment below.

If you insist on trying out my idea at the moment, be reminded you are proceeding with your own risk. Also, if someone knows that DeepDist still works, please report back in comments. It would help other readers.

462

asked Feb 25 '16 00:02

Patrick the Cat

1 Answers

To avoid this question from remaining shown as open, here is how the asker resolved the situation:

I did not get to implement this, until it's too late that I didn't think it would work. DeepDist uses Flask app in backend to interact with Spark web interface. Since it wasn't maintained anymore, Spark's update very likely broke it already. If you are looking for Doc2Vec training in Spark, just go for Deeplearning4J(deeplearning4j.org/doc2vec#)

194

answered Oct 20 '22 02:10

Dennis Jaheruddin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Doc2Vec and PySpark: Gensim Doc2vec over DeepDist

Tags:

apache-spark

pyspark

gensim

word2vec

Patrick the Cat

People also ask

1 Answers

Dennis Jaheruddin

Recent Activity

Donate For Us

Doc2Vec and PySpark: Gensim Doc2vec over DeepDist

Tags:

apache-spark

pyspark

gensim

word2vec

Patrick the Cat

People also ask

1 Answers

Dennis Jaheruddin

Related questions

Recent Activity

Donate For Us