Is there any way to get the vocabulary size from doc2vec model?

Tags:

I am using gensim doc2vec. I want know if there is any efficient way to know the vocabulary size from doc2vec. One crude way is to count the total number of words, but if the data is huge(1GB or more) then this won't be an efficient way.

625

asked Jan 12 '17 08:01

Rashmi Singh

2 Answers

If model is your trained Doc2Vec model, then the number of unique word tokens in the surviving vocabulary after applying your min_count is available from:

len(model.wv.vocab)

The number of trained document tags is available from:

len(model.docvecs)

178

answered Sep 25 '22 02:09

gojomo

The return data type of vocab is a dictionary. Use keys() as follows:

model.wv.vocab.keys()

This should return a list of words.

answered Sep 22 '22 02:09

Prometheus

Related questions
                            
                                Does gensim.corpora.Dictionary have term frequency saved?
                            
                                Doc2vec MemoryError
                            
                                Does Doc2Vec learn representations for the tags?
                            
                                pyLDAvis with Mallet LDA implementation : LdaMallet object has no attribute 'inference'
                            
                                NLTK - Automatically translating similar words
                            
                                Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"
                            
                                Troubleshooting tips for clustering word2vec output with DBSCAN
                            
                                Pipeline and GridSearch for Doc2Vec
                            
                                Cosine similarity between 0 and 1
                            
                                Python Gensim how to make WMD similarity run faster with multiprocessing
                            
                                Gensim get topic for a document (seen document)
                            
                                How to build a gensim dictionary that includes bigrams?
                            
                                Understanding the output of Doc2Vec from Gensim package
                            
                                Is there any way to match Gensim LDA output with topics in pyLDAvis graph?
                            
                                How to avoid decoding to str: need a bytes-like object error in pandas?
                            
                                How can I access output embedding(output vector) in gensim word2vec?
                            
                                How do you initialize a gensim corpus variable with a csr_matrix?
                            
                                Python NLP British English vs American English
                            
                                What is different between doc2vec models when the dbow_words is set to 1 or 0?
                            
                                UnpicklingError: invalid load key, '3'

Is there any way to get the vocabulary size from doc2vec model?

Tags:

gensim

word2vec

doc2vec

Rashmi Singh

People also ask

2 Answers

gojomo

Prometheus

Recent Activity

Donate For Us