Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doc2Vec and PySpark: Gensim Doc2vec over DeepDist

I am looking at the DeepDist (link) module and thinking to combine it with Gensim's Doc2Vec API to train paragraph vectors on PySpark. The link actually provides with the following clean example for how to do it for Gensim's Word2Vec model:

from deepdist import DeepDist
from gensim.models.word2vec import Word2Vec
from pyspark import SparkContext

sc = SparkContext()
corpus = sc.textFile('enwiki').map(lambda s: s.split())

def gradient(model, sentences):  # executes on workers
    syn0, syn1 = model.syn0.copy(), model.syn1.copy()   # previous weights
    model.train(sentences)
    return {'syn0': model.syn0 - syn0, 'syn1': model.syn1 - syn1}

def descent(model, update):      # executes on master
    model.syn0 += update['syn0']
    model.syn1 += update['syn1']

with DeepDist(Word2Vec(corpus.collect()) as dd:
    dd.train(corpus, gradient, descent)
    print dd.model.most_similar(positive=['woman', 'king'], negative=['man']) 

To my understanding, DeepDist is distributing the work of gradient descent into workers in batches, and the recombining them and updating at master. If I replace Word2Vec with Doc2Vec, there should be the document vectors that are being trained with the word vectors.

So I looked into the source code of gensim.models.doc2vec (link). There are the following fields in the Doc2Vec model instance:

  1. model.syn0
  2. model.syn0_lockf
  3. model.docvecs.doctag_syn0
  4. model.docvecs.doctag_syn0_lockf

Comparing with the source code of gensim.models.word2vec (link), the following fields went missing in Doc2Vec model:

  1. model.syn1
  2. model.syn1neg

I think I do not touch the lockf vectors because they seem to be used after the training is done when new data points come in. Therefore my code should be something like

from deepdist import DeepDist
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from pyspark import SparkContext

sc = SparkContext()

# assume my dataset is in format 10-char-id followed by doc content
# 1 line per doc
corpus = sc.textFile('data_set').map(
    lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10])
)

def gradient(model, sentence):  # executes on workers
    syn0, doctag_syn0 = model.syn0.copy(), model.docvecs.doctag_syn0.copy()   # previous weights
    model.train(sentence)
    return {'syn0': model.syn0 - syn0, 'doctag_syn0': model.docvecs.doctag_syn0 - doctag_syn0}

def descent(model, update):      # executes on master
    model.syn0 += update['syn0']
    model.docvecs.doctag_syn0 += update['doctag_syn0']

with DeepDist(Doc2Vec(corpus.collect()) as dd:
    dd.train(corpus, gradient, descent)
    print dd.model.most_similar(positive=['woman', 'king'], negative=['man']) 

Am I missing anything important here? For example:

  1. Should I care about model.syn1 at all? What do they mean after all?
  2. Am I right that model.*_lockf is the locked matrices after training?
  3. Is it ok that I use lambda s: LabeledSentence(words=s[10:].split(),labels=s[:10] to parse my dataset, assuming I have each document in one line, prefixed by a 0-padded 10-digit id?

Any suggestion/contribution are very appreciated. I will write up a blog post to summarize the result, mentioning contributors here, potentially to help others train Doc2Vec models on scaled distributed systems without spending much dev time trying to solve what I am solving now.

Thanks


Update 06/13/2018

My apologies as I did not get to implement this. But there are better options nowaday, and DeepDist haven't been maintained for awhile now. Please read comment below.

If you insist on trying out my idea at the moment, be reminded you are proceeding with your own risk. Also, if someone knows that DeepDist still works, please report back in comments. It would help other readers.

like image 462
Patrick the Cat Avatar asked Feb 25 '16 00:02

Patrick the Cat


People also ask

What is TaggedDocument Doc2Vec?

TaggedDocument. Bases: gensim.models.doc2vec.TaggedDocument. Represents a document along with a tag, input document format for Doc2Vec . A single document, made up of words (a list of unicode string tokens) and tags (a list of tokens).

What is Doc2Vec used for?

Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method. In order to understand doc2vec, it is advisable to understand word2vec approach.

Is Doc2Vec a model?

Doc2Vec is a Model that represents each Document as a Vector. This tutorial introduces the model and demonstrates how to train and assess it. Here's a list of what we'll be doing: Review the relevant models: bag-of-words, Word2Vec, Doc2Vec.


1 Answers

To avoid this question from remaining shown as open, here is how the asker resolved the situation:

I did not get to implement this, until it's too late that I didn't think it would work. DeepDist uses Flask app in backend to interact with Spark web interface. Since it wasn't maintained anymore, Spark's update very likely broke it already. If you are looking for Doc2Vec training in Spark, just go for Deeplearning4J(deeplearning4j.org/doc2vec#)

like image 194
Dennis Jaheruddin Avatar answered Oct 20 '22 02:10

Dennis Jaheruddin