I am using gensim doc2vec
. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.
The vector maps the document to a point in 100 dimensional space. A size of 200 would map a document to a point in 200 dimensional space. The more dimensions, the more differentiation between documents.
Doc2Vec is another widely used technique that creates an embedding of a document irrespective to its length. While Word2Vec computes a feature vector for every word in the corpus, Doc2Vec computes a feature vector for every document in the corpus.
If model
is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count
is available from:
len(model.wv.vocab)
The number of trained document tags is available from:
len(model.docvecs)
The return data type of vocab is a dictionary. Use keys() as follows:
model.wv.vocab.keys()
This should return a list of words.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With