I'm trying to replicate Mikolov's work of PV-DM + PV-DBOW. He says that both algorithms should be used in order to get better results. For this reason I'm trying to train the model and then give the document tags to t-SNE.
Using Gensim's Doc2Vec I can get the document tags with docvecs.vectors_docs, but the concatenated structure doesn't appear to have the document tags of the joint model. It is still treating the models as separate entities.
(This I can see from the variable explorer)
I'm also using the ConcatenatedDoc2Vec from gensim.
Can anyone help me? Is there a way I can get the document tags from the concatenated new entity and not the individual ones?
Be warned that many have tried to reproduce the reported 'Paragraph Vector' results using concatenated PV-DBOW and PV-DM+dm_concat vectors without success. (For example, Mikolov himself reports being unable to reproduce the exact numbers that he says co-author Le contributed to the paper.)
The ConcatenatedDoc2Vec class is just a thin wrapper to join two models you've already trained separately, for the purposes of vector-lookup-by-tag (__getitem__() indexed access) and combined inference. (It's a mere 10 lines of code.)
To make this post-training join sensible, those two models should have been trained with the exact same documents/tags in the exact same order.
So if you need a list of tags, ask either model separately.
If you need some other combination of the two models – such as a single large array including all concatenated vectors – you'd have to construct that yourself, perhaps using numpy's hstack method.
You can see my notebook trying to reproduce some of the paper's results inside the gensim docs/notebooks directory, or view online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With