Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

word2vec gensim multiple languages

This problem is going completely over my head. I am training a Word2Vec model using gensim. I have provided data in multiple languages i.e. English and Hindi. When I am trying to find the words closest to 'man', this is what I am getting:

model.wv.most_similar(positive = ['man'])
Out[14]: 
[('woman', 0.7380284070968628),
 ('lady', 0.6933152675628662),
 ('monk', 0.6662989258766174),
 ('guy', 0.6513140201568604),
 ('soldier', 0.6491742134094238),
 ('priest', 0.6440571546554565),
 ('farmer', 0.6366012692451477),
 ('sailor', 0.6297377943992615),
 ('knight', 0.6290514469146729),
 ('person', 0.6288090944290161)]
--------------------------------------------

Problem is, these are all English words. Then I tried to find similarity between same meaning Hindi and English words,

model.similarity('man', 'आदमी')
__main__:1: DeprecationWarning: Call to deprecated `similarity` (Method will 
be removed in 4.0.0, use self.wv.similarity() instead).
Out[13]: 0.078265618974427215

This accuracy should have been better than all the other accuracies. The Hindi corpus I have has been made by translating the English one. Hence the words appear in similar contexts. Hence they should be close.

This is what I am doing here:

#Combining all the words together.
all_reviews=HindiWordsList + EnglishWordsList

#Training FastText model
cpu_count=multiprocessing.cpu_count()
model=Word2Vec(size=300,window=5,min_count=1,alpha=0.025,workers=cpu_count,max_vocab_size=None,negative=10)
model.build_vocab(all_reviews)
model.train(all_reviews,total_examples=model.corpus_count,epochs=model.iter)
model.save("word2vec_combined_50.bin")
like image 740
hiteshn97 Avatar asked Oct 19 '25 03:10

hiteshn97


2 Answers

I have been dealing with a very similar problem and came across a reasonably robust solution. This paper shows that a linear relationship can be defined between two Word2Vec models that have been trained on different languages. This means you can derive a translation matrix to convert word embeddings from one language model into the vector space of another language model. What does all of that mean? It means I can take a word from one language, and find words in the other language that have a similar meaning.

I've written a small Python package that implements this for you: transvec. Here's an example where I use pre-trained models to search for Russian words and find English words with a similar meaning:

import gensim.downloader
from transvec.transformers import TranslationWordVectorizer

# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")

# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
    ("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
    ("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]

bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)

# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]

Don't worry about the weird POS tags on the Russian words - this is just a quirk of the particular pre-trained model I used.

So basically, if you can provide a list of words with their translations, then you can train a TranslationWordVectorizer to translate any word that exists in your source language corpus into the target language. When I used this for real, I produced some training data by extracting all the individual Russian words from my data, running them through Google Translate and then keeping everything that translated to a single word in English. The results were pretty good (sorry I don't have any more detail for the benchmark yet; it's still a work in progress!).

like image 127
big-o Avatar answered Oct 21 '25 18:10

big-o


First of all, you should really use self.wv.similarity().

I'm assuming there are very close to no words that exist in both between your Hindi corpus and English corpus, since Hindi corpus is in Devanagari and English is in, well, English. Simply adding two corpuses together to make a model does not make sense. Corresponding words in the two languages co-occur in two versions of a document, but not in your word embeddings for Word2Vec to figure out most similar.

Eg. Until your model knows that

Man:Aadmi::Woman:Aurat,

from the word embeddings, it can never make out the

Raja:King::Rani:Queen

relation. And for that, you need some anchor between the two corpuses. Here are a few suggestions that you can try out:

  1. Make an independent Hindi corpus/model
  2. Maintain and lookup data of a few English->Hindi word pairs that you have will have to create manually.
  3. Randomly replace input document words with their counterparts from the corresponding document while training

These might be enough to give you an idea. You can also look into seq2seq if you want only want to do translations. You can also read the Word2Vec theory in detail to understand what it does.

like image 44
Kalpit Avatar answered Oct 21 '25 17:10

Kalpit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!