Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any way to get the vocabulary size from doc2vec model?

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

like image 625
Rashmi Singh Avatar asked Jan 12 '17 08:01

Rashmi Singh


People also ask

What is vector size in Doc2Vec?

The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.

What is the difference between word2vec and Doc2Vec?

Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.


2 Answers

If model is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count is available from:

len(model.wv.vocab)

The number of trained document tags is available from:

len(model.docvecs)
like image 178
gojomo Avatar answered Sep 25 '22 02:09

gojomo


The return data type of vocab is a dictionary. Use keys() as follows:

model.wv.vocab.keys()

This should return a list of words.

like image 35
Prometheus Avatar answered Sep 22 '22 02:09

Prometheus