gensim Doc2Vec vs tensorflow Doc2Vec

Gensim

model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores) model.build_vocab(corpus) epochs = 100 for i in range(epochs):     model.train(corpus)

TF

batch_size = 512 embedding_size = 100 # Dimension of the embedding vector. num_sampled = 10 # Number of negative examples to sample.   graph = tf.Graph()  with graph.as_default(), tf.device('/cpu:0'):     # Input data.     train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size])     train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window])     train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1])      # The variables        word_embeddings =  tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0))     doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0))     softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size],                              stddev=1.0 / np.sqrt(embedding_size)))     softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))      ###########################     # Model.     ###########################     # Look up embeddings for inputs and stack words side by side     embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset),                             shape=[int(batch_size/context_window),-1])     embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset)     embed = tf.concat(1,[embed_words, embed_docs])     # Compute the softmax loss, using a sample of the negative labels each time.     loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed,                                    train_labels, num_sampled, vocabulary_size))      # Optimizer.     optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)

Update:

Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.

538

asked Oct 04 '16 03:10

sachinruk

1 Answers

Old question, but an answer would be useful for future visitors. So here are some of my thoughts.

There are some problems in the tensorflow implementation:

window is 1-side size, so window=5 would be 5*2+1 = 11 words.
Note that with PV-DM version of doc2vec, the batch_size would be the number of documents. So train_word_dataset shape would be batch_size * context_window, while train_doc_dataset and train_labels shapes would be batch_size.
More importantly, sampled_softmax_loss is not negative_sampling_loss. They are two different approximations of softmax_loss.

So for the OP's listed questions:

This implementation of doc2vec in tensorflow is working and correct in its own way, but it is different from both the gensim implementation and the paper.
window is 1-side size as said above. If document size is less than context size, then the smaller one would be use.
There are many reasons why gensim implementation is faster. First, gensim was optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such as min_count filtering in gensim would reduce the dataset size. More importantly, gensim uses negative_sampling_loss, which is much faster than sampled_softmax_loss, I guess this is the main reason.
Is it easier to find somethings when there are many of them? Just kidding ;-)
It's true that there are many solutions in this non-convex optimization problem, so the model would just find a local optimum. Interestingly, in neural network, most local optima are "good enough". It has been observed that stochastic gradient descent seems to find better local optima than larger batch gradient descent, although this is still a riddle in current research.

130

answered Oct 02 '22 22:10

THN

Related questions
                            
                                Installed virtualenv and virtualenvwrapper: Python says no module named virtualenvwrapper
                            
                                How do I install python on alpine linux?
                            
                                PIL /JPEG Library: "decoder jpeg not available"
                            
                                Sliding window of M-by-N shape numpy.ndarray
                            
                                python list comprehension with multiple 'if's
                            
                                Get the second largest number in a list in linear time
                            
                                How to write unicode strings into a file? [duplicate]
                            
                                Display fullscreen mode on Tkinter
                            
                                Inserting an item in a Tuple [duplicate]
                            
                                Python (pip) - RequestsDependencyWarning: urllib3 (1.9.1) or chardet (2.3.0) doesn't match a supported version
                            
                                xls to csv converter
                            
                                How to limit a number to be within a specified range? (Python)
                            
                                Clear all widgets in a layout in pyqt
                            
                                Generate unique id in django from a model field
                            
                                ImportError: No module named mysql.connector using Python2
                            
                                Enforcing python version in setup.py
                            
                                Efficient calculation of Fibonacci series
                            
                                Nose unable to find tests in ubuntu
                            
                                Running a test suite with over a million test cases
                            
                                Error packaging Kivy with numpy library for Android using buildozer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

gensim Doc2Vec vs tensorflow Doc2Vec

Tags:

python

tensorflow

nlp

gensim

doc2vec

Gensim

TF

Update:

sachinruk

People also ask

1 Answers

THN

Recent Activity

Donate For Us