I have some sample sentences that I want to run through a Doc2Vec model. My end goal is a matrix of size (num_sentences, num_features).
I'm using the Gensim package.
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
# warning: long sample of data. It's just 40 sentences really though.
labeled_sents = [TaggedDocument(words=['u0644', 'u0646', 'u062f', 'u0646', 'u060c', 'u0628', 'u0631', 'u0637', 'u0627', 'u0646', 'u06cc', 'u06c1', 'u06a9', 'u0627'], tags='400'), TaggedDocument(words=['do', 'pan', 'en', '1713', 'o', 'soar', 'onde', 'se', 'sit', 'xfaa'], tags='401'), TaggedDocument(words=['u0420', 'u044c', 'u043e', 'u043d', 'u0442', 'u0433', 'u0435', 'u043d', '1901', 'xa0', 'u2022', 'u041b', 'u043e', 'u0440', 'u0435', 'u043d', 'u0446', 'xa0', 'u0417', 'u0435', 'u0435', 'u043c', 'u0430', 'u043d', '1902', 'xa0', 'u2022', 'u0411', 'u0435', 'u043a', 'u0435', 'u0440', 'u0435', 'u043b', 'xa0', 'u041f', 'u0438', 'u0435', 'u0440', 'u041a', 'u044e', 'u0440', 'u0438', 'xa0', 'u041c', 'u0430', 'u0440', 'u0438', 'u044f', 'u041a', 'u044e', 'u0440', 'u0438', '1903', 'xa0', 'u2022', 'u0420', 'u0435', 'u043b', 'u0435', 'u0439', '1904', 'xa0', 'u2022', 'u041b', 'u0435', 'u043d', 'u0430', 'u0440', 'u0434', '1905', 'xa0', 'u2022', 'u0414', 'u0436', 'u0414', 'u0436', 'u0422', 'u043e', 'u043c', 'u0441', 'u044a', 'u043d', '1906', 'xa0', 'u2022', 'u041c', 'u0430', 'u0439', 'u043a', 'u0435', 'u043b', 'u0441', 'u044a', 'u043d', '1907', 'xa0', 'u2022', 'u041b', 'u0438', 'u043f', 'u043c', 'u0430', 'u043d', '1908', 'xa0', 'u2022', 'u041c', 'u0430', 'u0440', 'u043a', 'u043e', 'u043d', 'u0438', 'xa0', 'u0411', 'u0440', 'u0430', 'u0443', 'u043d', '1909', 'xa0', 'u2022', 'u0412', 'u0430', 'u043d', 'xa0', 'u0434', 'u0435', 'u0440', 'xa0', 'u0412', 'u0430', 'u0430', 'u043b', 'u0441', '1910', 'xa0', 'u2022', 'u0412', 'u0438', 'u043d', '1911', 'xa0', 'u2022', 'u0414', 'u0430', 'u043b', 'u0435', 'u043d', '1912', 'xa0', 'u2022', 'u041a', 'u0430', 'u043c', 'u0435', 'u0440', 'u043b', 'u0438', 'u043d', 'u0433', 'xa0', 'u041e', 'u043d', 'u0435', 'u0441', '1913', 'xa0', 'u2022', 'u0424', 'u043e', 'u043d', 'xa0', 'u041b', 'u0430', 'u0443', 'u0435', '1914', 'xa0', 'u2022', 'u0423', 'u0438', 'u043b', 'u044f', 'u043c', 'u041b', 'u0411', 'u0440', 'u0430', 'u0433', 'xa0', 'u0423', 'u0438', 'u043b', 'u044f', 'u043c', 'u0425', 'u0411', 'u0440', 'u0430', 'u0433', '1915', 'xa0', 'u2022', 'u0411', 'u0430', 'u0440', 'u043a', 'u043b', 'u0430', '1917', 'xa0', 'u2022', 'u041f', 'u043b', 'u0430', 'u043d', 'u043a', '1918', 'xa0', 'u2022', 'u0429', 'u0430', 'u0440', 'u043a', '1919'], tags='402'), TaggedDocument(words=['nagusia', 'da'], tags='403'), TaggedDocument(words=['sino', 'que', 'los', 'ciudadanos', 'pueden', 'elegir', 'detraer', 'un', 'porcentaje', 'de', 'sus', 'impuestos', 'para', 'esta', 'causa', '68', '69', 'un', 'sistema', 'similar', 'se', 'da', 'en', 'alemania', 'o', 'austria', 'aunque', 'all', 'xed', 'se', 'impone', 'un', 'impuesto', 'eclesi', 'xe1stico'], tags='404'), TaggedDocument(words=['1244', 'c', 'xfc'], tags='405'), TaggedDocument(words=['u062a', 'u063a', 'u064a', 'u064a', 'u0631', 'u0644', 'u0641', 'u0638', 'u0627', 'u0644', 'u0643', 'u0644', 'u0645', 'u0629', 'u060c', 'u0641', 'u0645', 'u062b', 'u0644', 'u0627', 'u064b', 'rat', 'u062a', 'u0644', 'u0641', 'u0638', 'u0631', 'u0627', 'u062a'], tags='406'), TaggedDocument(words=['d', 'xfcrziler'], tags='407'), TaggedDocument(words=['xung', 'quanh', 'u0111', 'xf3'], tags='408'), TaggedDocument(words=['oblika', 'u0161to'], tags='409'), TaggedDocument(words=['u0432', 'u0430', 'u043b', 'u044e', 'u0442', 'u043d', 'u043e', 'u0433', 'u043e', 'u0441', 'u043e', 'u044e', 'u0437', 'u0443'], tags='410'), TaggedDocument(words=['sacerdotal', 'es'], tags='411'), TaggedDocument(words=['natoque', 'nisi'], tags='412'), TaggedDocument(words=['u0631', 'u0627', 'u0645', 'u06cc', 'u200c', 'u062a', 'u0648', 'u0627', 'u0646', 'u062f', 'u0631', 'u0627', 'u06cc', 'u0627', 'u0644', 'u0627', 'u062a', 'u0645', 'u062a', 'u062d', 'u062f', 'u0647', 'u0622', 'u0645', 'u0631', 'u06cc', 'u06a9', 'u0627', 'u06a9', 'u0627', 'u0646', 'u0627', 'u062f', 'u0627', 'u0628', 'u0631', 'u0632', 'u06cc', 'u0644', 'u0648', 'u0622', 'u0631', 'u0698', 'u0627', 'u0646', 'u062a', 'u06cc', 'u0646'], tags='413'), TaggedDocument(words=['u0423', 'u0439', 'u0433', 'u0443', 'u0440', 'u0441', 'u044c', 'u043a', 'u0430', 'u043c', 'u043e', 'u0432', 'u0430'], tags='414'), TaggedDocument(words=['termin', 'poznat', 'kao'], tags='415'), TaggedDocument(words=['les', 'fr', 'xe8res', 'lumi', 'xe8re'], tags='416'), TaggedDocument(words=['26', 'u03c0', 'u03b5', 'u03c1', 'u03af', 'u03c0', 'u03bf', 'u03c5', 'u03b1', 'u03b9', 'u03ce', 'u03bd', 'u03b5', 'u03c2', 'u03b7', 'u03c0', 'u03cc', 'u03bb', 'u03b7', 'u03c4', 'u03b7', 'u03c2', 'u0391', 'u03c5', 'u03bb', 'u03ce', 'u03bd', 'u03b1', 'u03c2', 'u03b5', 'u03af', 'u03bd', 'u03b1', 'u03b9', 'u03c3', 'u03ae', 'u03bc', 'u03b5', 'u03c1', 'u03b1'], tags='417'), TaggedDocument(words=['xcen', '13'], tags='418'), TaggedDocument(words=['acts', 'of', 'civil', 'disobedience', 'forced', 'the', 'head', 'of', 'the', 'local'], tags='419'), TaggedDocument(words=['hugo', 'az', 'xe1llamcs', 'xedny'], tags='420'), TaggedDocument(words=['f', 'xf8rste', 'nu', 'uofficielle', 'vers', 'forbindes', 'ofte', 'med', 'nynazistiske', 'synspunkter'], tags='421'), TaggedDocument(words=['gisulti', 'kanila', 'sa', 'mga', 'langyaw', 'nagtuong', 'gipangutana', 'sila', 'kon'], tags='422'), TaggedDocument(words=['u043d', 'u0430', 'u0438', 'u0432', 'u0440', 'u0438', 'u0442'], tags='423'), TaggedDocument(words=['its', 'influence'], tags='424'), TaggedDocument(words=['a', 'b', 'azerbaijan', 'homeowners', 'evicted', 'for', 'city'], tags='425'), TaggedDocument(words=['dinast', 'xeda', 'lunar', 'de'], tags='426'), TaggedDocument(words=['2', 'wyznawa', 'u0142o', 'judaizmu', '5', 'ponad'], tags='427'), TaggedDocument(words=['quyosh', 'vaqt', 'degani'], tags='428'), TaggedDocument(words=['u306e', 'u884c', 'u4fe1', 'u30fb', 'u91cd', 'u5f18', 'u3001', 'u9678', 'u5965', 'u56fd', 'u306e', 'u821e', 'u8349', 'u6d3e', 'u3001', 'u51fa', 'u7fbd', 'u56fd', 'u306e', 'u6708', 'u5c71', 'u6d3e', 'u3001', 'u4f2f', 'u8006', 'u56fd', 'u306e', 'u5b89', 'u92fc', 'u6d3e', 'u3001', 'u5099', 'u4e2d', 'u56fd', 'u306e', 'u53e4', 'u9752', 'u6c5f', 'u6d3e', 'u306e', 'u5b88', 'u6b21', 'u30fb', 'u6052', 'u6b21', 'u30fb', 'u5eb7', 'u6b21', 'u30fb', 'u8c9e', 'u6b21', 'u30fb', 'u52a9', 'u6b21', 'u30fb', 'u5bb6', 'u6b21', 'u30fb', 'u6b63', 'u6052', 'u3001', 'u8c4a', 'u5f8c', 'u56fd', 'u306e', 'u5b9a', 'u79c0', 'u6d3e', 'u3001', 'u85a9', 'u6469', 'u56fd', 'u306e', 'u53e4', 'u6ce2', 'u5e73', 'u6d3e', 'u306e', 'u884c', 'u5b89', 'u306a', 'u3069', 'u304c', 'u5b58', 'u5728', 'u3059', 'u308b', '7', '8', '9'], tags='429'), TaggedDocument(words=['p', 'xe5', '4'], tags='430'), TaggedDocument(words=['editovat'], tags='431'), TaggedDocument(words=['u0437', 'u0437', 'u0430', 'u0431', 'u043e', 'u0439', 'u0441', 'u0442', 'u0432', 'u0430', 'u043c', 'u0443'], tags='432'), TaggedDocument(words=['10', 'u043b', 'u0438', 'u043f', 'u043d', 'u044f', '1943', 'u0440', 'u043e', 'u043a', 'u0443', 'u0441', 'u043e', 'u044e', 'u0437', 'u043d', 'u0438', 'u043a', 'u0438', 'u0432', 'u0438', 'u0441', 'u0430', 'u0434', 'u0438', 'u043b', 'u0438', 'u0441', 'u044f', 'u0432', 'u0421', 'u0438', 'u0446', 'u0438', 'u043b', 'u0456', 'u0457', 'u0406', 'u0442', 'u0430', 'u043b', 'u0456', 'u0439', 'u0441', 'u044c', 'u043a', 'u0456'], tags='433'), TaggedDocument(words=['136', 'selvom', 'det', 'egentligt', 'ligger', 'i', 'sundby', 'p', 'xe5', 'lollandssiden', 'af', 'guldborgsund', 'centret', 'blev', 'grundlagt', 'i', '1989', 'da', 'byen', 'fejrede', '700', 'xe5rs', 'jubil', 'xe6um', 'bymuseet', 'rekonstruerede', 'som', 'de', 'f', 'xf8rste', 'i', 'verden', 'en', 'middelalderlig', 'kastemaskine', 'kaldet', 'en', 'blide'], tags='434'), TaggedDocument(words=['latine', 'redditur'], tags='435'), TaggedDocument(words=['ljubljani', 'in', 'njeni'], tags='436'), TaggedDocument(words=['u0442', 'u0430', 'u043d', 'u044b', 'u043c', 'u0430', 'u043b', 'u049b', 'u043e', 'u043d', 'u0430', 'u049b', 'u04af', 'u0439', 'u043b', 'u0435', 'u0440'], tags='437'), TaggedDocument(words=['u2022', 'hassib', 'ben'], tags='438'), TaggedDocument(words=['kurtulmu', 'u015f', 'olan', 'u0130talya'], tags='439')]
model = Doc2Vec(documents=labeled_sents, size=10, alpha=.035, window=4,
sample=1e-5, workers=4, min_count=1)
Now, I thought that model.docvecs
would give me a list of arrays, with the first array corresponding to the vector for sentence 1, the second array corresponding to the vector for sentence 2, etc. But instead, it's got length 10!
I get model.docvecs[0] = array([ 0.02312995, -0.00339695, -0.01273827, 0.01944644, -0.03247212, -0.04663946, 0.01369059, 0.03289782, 0.03516903, -0.03435936], dtype=float32)
What are these docvecs
then? How do I get the output desired, which is a matrix of dimensions (40, 10) in this example?
I saw this here, and the correct answer says at the bottom "where 99 is the document id whose vector we want." So this makes me even more confused, as he seems to say that model.docvecs
SHOULD be indexing a matrix where each row is a document vector!
According to Gensim doc2vec tutorial on the IMDB sentiment data set, combining a paragraph vector from Distributed Bag of Words (DBOW) and Distributed Memory (DM) improves performance. We will follow, pairing the models together for evaluation. First, we delete temporary training data to free up RAM.
While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus. Doc2vec model is based on Word2Vec, with only adding another vector (paragraph ID) to the input.
Doc2Vec is an extension of Word2vec that encodes entire documents as opposed to individual words. You can read about Word2Vec in my previous post. Doc2Vec vectors represent the theme or overall meaning of a document. In this case, a document is a sentence, a paragraph, an article, or an essay, etc.
TaggedDocument
expects tags to be a list
of tags related to document.
In your case,
sentence = TaggedDocument(words=['a', 'b'], tags='400')
gets interpreted as sentence having 3 tags ['4','0','0']
, and hence model.docvecs
returns vectors corresponding to 10 tags - ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Try changing this to
sentence = TaggedDocument(words=['a', 'b'], tags=['400'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With