I'm trying to compare my implementation of Doc2Vec (via tf) and gensims implementation. It seems atleast visually that the gensim ones are performing better.
I ran the following code to train the gensim model and the one below that for tensorflow model. My questions are as follows:
window=5
parameter in gensim mean that I am using two words on either side to predict the middle one? Or is it 5 on either side. Thing is there are quite a few documents that are smaller than length 10.model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=10, hs=0, min_count=2, workers=cores) model.build_vocab(corpus) epochs = 100 for i in range(epochs): model.train(corpus)
batch_size = 512 embedding_size = 100 # Dimension of the embedding vector. num_sampled = 10 # Number of negative examples to sample. graph = tf.Graph() with graph.as_default(), tf.device('/cpu:0'): # Input data. train_word_dataset = tf.placeholder(tf.int32, shape=[batch_size]) train_doc_dataset = tf.placeholder(tf.int32, shape=[batch_size/context_window]) train_labels = tf.placeholder(tf.int32, shape=[batch_size/context_window, 1]) # The variables word_embeddings = tf.Variable(tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)) doc_embeddings = tf.Variable(tf.random_uniform([len_docs,embedding_size],-1.0,1.0)) softmax_weights = tf.Variable(tf.truncated_normal([vocabulary_size, (context_window+1)*embedding_size], stddev=1.0 / np.sqrt(embedding_size))) softmax_biases = tf.Variable(tf.zeros([vocabulary_size])) ########################### # Model. ########################### # Look up embeddings for inputs and stack words side by side embed_words = tf.reshape(tf.nn.embedding_lookup(word_embeddings, train_word_dataset), shape=[int(batch_size/context_window),-1]) embed_docs = tf.nn.embedding_lookup(doc_embeddings, train_doc_dataset) embed = tf.concat(1,[embed_words, embed_docs]) # Compute the softmax loss, using a sample of the negative labels each time. loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, embed, train_labels, num_sampled, vocabulary_size)) # Optimizer. optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)
Check out the jupyter notebook here (I have both models working and tested in here). It still feels like the gensim model is performing better in this initial analysis.
Gensim does not use TensorFlow and it has its own methods for loading and saving models.
Advertisements. Doc2Vec model, as opposite to Word2Vec model, is used to create a vectorised representation of a group of words taken collectively as a single unit. It doesn't only give the simple average of the words in the sentence.
Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.
Doc2vec is an NLP tool for representing documents as a vector and is a generalizing of the word2vec method.
Old question, but an answer would be useful for future visitors. So here are some of my thoughts.
There are some problems in the tensorflow
implementation:
window
is 1-side size, so window=5
would be 5*2+1
= 11
words. batch_size
would be the number of documents. So train_word_dataset
shape would be batch_size * context_window
, while train_doc_dataset
and train_labels
shapes would be batch_size
.sampled_softmax_loss
is not negative_sampling_loss
. They are two different approximations of softmax_loss
.So for the OP's listed questions:
doc2vec
in tensorflow
is working and correct in its own way, but it is different from both the gensim
implementation and the paper.window
is 1-side size as said above. If document size is less than context size, then the smaller one would be use.gensim
implementation is faster. First, gensim
was optimized heavily, all operations are faster than naive python operations, especially data I/O. Second, some preprocessing steps such as min_count
filtering in gensim
would reduce the dataset size. More importantly, gensim
uses negative_sampling_loss
, which is much faster than sampled_softmax_loss
, I guess this is the main reason.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With