I have download 100 billion word Google news pretrained vector file. On top of that i am also training my own 3gb data producing another pretrained vector file. Both has 300 feature dimensions and more than 1gb size.
How do i merge these two huge pre-trained vectors? or how do i train a new model and update vectors on top of another? I see that C based word2vec does not support batch training.
I am looking to compute word analogy from these two models. I believe that vectors learned from these two sources will produce pretty good results.
The word2vec model can create numeric vector representations of words from the training text corpus that maintains the semantic and syntactic relationship. A very famous example of how word2vec preserves the semantics is when you subtract the word Man from King and add Woman it gives you Queen as one of the closest results.
Applying Word Embeddings After training the word2vec model, we can use the cosine similarity of word vectors from the trained model to find words from the dictionary that are most semantically similar to an input word. def get_similar_tokens(query_token, k, embed): W = embed.weight.data x = W[vocab[query_token]] # Compute the cosine similarity.
Let's call word2vec vector model W & glove G. Now, an embedding is just a vector and W is a vector space. These two embeddings are in different vector spaces. You need to either align the 2 vector spaces like in this paper by Mikolov.
The word2vec architectures were proposed by a team of researchers led by Tomas Mikolov at Google in 2013. The word2vec model can create numeric vector representations of words from the training text corpus that maintains the semantic and syntactic relationship.
There's no straightforward way to merge the end-results of separate training sessions.
Even for the exact same data, slight randomization from initial seeding or thread scheduling jitter will result in diverse end states, making vectors only fully comparable within the same session.
This is because every session finds a useful configuration of vectors... but there are many equally useful configurations, rather than a single best.
For example, whatever final state you reach has many rotations/reflections that can be exactly as good on the training prediction task, or perform exactly as well on some other task (like analogies-solving). But most of these possible alternatives will not have coordinates that can be mixed-and-matched for useful comparisons against each other.
Preloading your model with data from prior training runs might improve the results after more training with new data, but I'm not aware of any rigorous testing of this possibility. The effect likely depends on your specific goals, your parameter choices, and how much the new and old data are similar, or representative of the eventual data against which the vectors will be used.
For example, if the Google News corpus is unlike your own training data, or the text you'll be using the word-vectors to understand, using it as a starting point might just slow or bias your training. On the other hand, if you train on your new data long enough, eventually any influence of the preloaded values could be diluted to nothingness. (If you really wanted a 'blended' result, you might have to simultaneously train on the new data with an interleaved goal for nudging the vectors back towards the prior-dataset values.)
Ways to combine the results from independent sessions might make a good research project. Maybe the method used in the word2vec language-translation projects – learning a projection between vocabulary spaces – could also 'translate' between the different coordinates of different runs. Maybe locking some vectors in place, or training on the dual goals of 'predict the new text' and 'stay close to the old vectors' would give meaningfully improved combined results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With