How to fetch vectors for a word list with Word2Vec?

Tags:

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?

I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.

803

asked Jul 15 '15 20:07

jonbon

1 Answers

The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].

Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.

Thus, to create a dictionary object, you may:

my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
    my_dict[key] = model.wv[key]
    # Or my_dict[key] = model.wv.get_vector(key)
    # Or my_dict[key] = model.wv.word_vec(key, use_norm=False)

Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:

{'one': array([-0.06590105,  0.01573388,  0.00682817,  0.53970253, -0.20303348,
   -0.24792041,  0.08682659, -0.45504045,  0.89248925,  0.0655603 ,
   ......
   -0.8175681 ,  0.27659689,  0.22305458,  0.39095637,  0.43375066,
    0.36215973,  0.4040089 , -0.72396156,  0.3385369 , -0.600869  ],
  dtype=float32),
 'two': array([ 0.04694849,  0.13303463, -0.12208422,  0.02010536,  0.05969441,
   -0.04734801, -0.08465996,  0.10344813,  0.03990637,  0.07126121,
    ......
    0.31673026,  0.22282903, -0.18084198, -0.07555179,  0.22873943,
   -0.72985399, -0.05103955, -0.10911274, -0.27275378,  0.01439812],
  dtype=float32),
 'three': array([-0.21048863,  0.4945509 , -0.15050395, -0.29089224, -0.29454648,
    0.3420335 , -0.3419629 ,  0.87303966,  0.21656844, -0.07530259,
    ......
   -0.80034876,  0.02006451,  0.5299498 , -0.6286509 , -0.6182588 ,
   -1.0569025 ,  0.4557548 ,  0.4697938 ,  0.8928275 , -0.7877308 ],
  dtype=float32),
  'four': ......
}

You may also wish to look into the native save / load methods of Gensim's word2vec.

147

answered Sep 21 '22 13:09

Moobie

Related questions
                            
                                Pass PCA preprocessing arguments to train()
                            
                                scikit learn SVM, how to save/load support vectors?
                            
                                Eligibility trace reinitialization between episodes in SARSA-Lambda implementation
                            
                                Is using batch size as 'powers of 2' faster on tensorflow?
                            
                                VotingClassifier: Different Feature Sets
                            
                                What does the Brown clustering algorithm output mean?
                            
                                sklearn: use Pipeline in a RandomizedSearchCV?
                            
                                Error with Sklearn Random Forest Regressor
                            
                                Dealing with the class imbalance in binary classification
                            
                                What is the difference between MaxPool and MaxPooling layers in Keras?
                            
                                How to use a custom SVM kernel?
                            
                                Tensorflow ValueError: No variables to save from
                            
                                Should Feature Selection be done before Train-Test Split or after?
                            
                                Retraining after Cross Validation with libsvm
                            
                                Predicting new data using sklearn after standardizing the training data
                            
                                'Dense' object has no attribute 'op' [closed]
                            
                                Cost function in logistic regression gives NaN as a result
                            
                                How to extract feature importances from an Sklearn pipeline
                            
                                Merge 2 sequential models in Keras
                            
                                What is the difference between classification and prediction?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to fetch vectors for a word list with Word2Vec?

Tags:

artificial-intelligence

machine-learning

nlp

word2vec

jonbon

People also ask

1 Answers

Moobie

Recent Activity

Donate For Us