Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding word2vec text representation [closed]

Tags:

nlp

word2vec

I would like to implement the distance part of word2vec in my program. Unfortunately it's not in C/C++ or Python, but first I did not understand the non-binary representation. This is how I get the file ./word2vec -train text8-phrase -output vectorsphrase.txt -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 0

When I check vectorsphrase.txt file for france all I get is:

 france -0.062591 0.264201 0.236335 -0.072601 -0.094313 -0.202659 -0.373314 0.074684 -0.262307 0.139383 -0.053648 -0.154181 0.126962 0.432593 -0.039440 0.108096 0.083703 0.148991 0.062826 0.048151 0.005555 0.066885 0.004729 -0.013939 -0.043947 0.057280 -0.005259 -0.223302 0.065608 -0.013932 -0.199372 -0.054966 -0.026725 0.012510 0.076350 -0.027816 -0.187357 0.248191 -0.085087 0.172979 -0.116789 0.014136 0.131571 0.173892 0.316052 -0.045492 0.057584 0.028944 -0.193623 0.043965 -0.166696 0.111058 0.145268 -0.119645 0.091659 0.056593 0.417965 -0.002927 -0.081579 -0.021356 0.030447 0.052507 -0.109058 -0.011124 -0.136975 0.104396 0.069319 0.030266 -0.193283 -0.024614 -0.025636 -0.100761 0.032366 0.069175 0.200204 -0.042976 -0.045123 -0.090475 0.090071 -0.037075 0.182373 0.151529 0.080198 -0.024067 -0.196623 -0.204863 0.154429 -0.190242 -0.063265 -0.323000 -0.109863 0.102366 -0.085017 0.198042 -0.033342 0.119225 0.176891 0.214628 0.031771 0.168739 0.063246 -0.147353 -0.003526 0.138835 -0.172460 -0.133294 -0.369451 0.063572 0.076098 -0.116277 0.208374 0.015783 0.145032 0.090530 -0.090470 0.109325 0.119184 0.024838 0.101194 -0.184154 -0.161648 -0.039929 0.079321 0.029462 -0.016193 -0.005485 0.197576 -0.118860 0.019042 -0.137174 -0.047933 -0.008472 0.092360 0.165395 0.013253 -0.099013 -0.017355 -0.048332 -0.077228 0.034320 -0.067505 -0.050190 -0.320440 -0.040684 -0.106958 -0.169634 -0.014216 0.225693 0.345064 0.135220 -0.181518 -0.035400 -0.095907 -0.084446 0.025784 0.090736 -0.150824 -0.351817 0.174671 0.091944 -0.112423 -0.140281 0.059532 0.002152 0.127812 0.090834 -0.130366 -0.061899 -0.280557 0.076417 -0.065455 0.205525 0.081539 0.108110 0.013989 0.133481 -0.256035 -0.135460 0.127465 0.113008 0.176893 -0.018049 0.062527 0.093005 -0.078622 -0.109232 0.065856 0.138583 0.097186 -0.124459 0.011706 0.113376 0.024505 -0.147662 -0.118035 0.129616 0.114539 0.165050 -0.134871 -0.036298 -0.103552 -0.108726 0.025950 0.053895 -0.173731 0.201482 -0.198697 -0.339452 0.166154 -0.014059 0.022529 0.212491 -0.051978 0.057627 0.198319 0.092990 -0.171989 -0.060376 0.084172 -0.034411 -0.065443 0.054581 -0.024187 0.072550 0.113017 0.080476 -0.170499 0.148091 -0.010503 0.158095 0.111080 0.007598 0.042551 -0.161005 -0.078712 0.318305 -0.011473 0.065593 0.121385 0.087249 -0.011178 0.053639 -0.100713 0.168689 0.120121 -0.058025 -0.161788 -0.101135 -0.080533 0.120502 -0.099477 0.187640 -0.054496 0.180532 -0.097961 0.049633 -0.019596 0.145623 0.284261 0.039761 0.053866 0.089779 -0.000676 -0.081653 0.082819 0.263937 -0.141818 0.011605 -0.028248 -0.020573 0.091329 -0.080264 -0.358647 -0.134252 0.115414 -0.066107 0.150770 -0.018897 0.168325 0.111375 -0.091567 -0.152783 -0.034834 -0.418656 -0.091504 -0.134671 0.051754 -0.129495 0.230855 -0.339259 0.208410 0.191621 0.007837 -0.016602 -0.131502 -0.059481 -0.185196 0.303028 0.017646 -0.047340 

So apart from cosine values I don't get anything else, and when I run the distance and type france I get

Word Cosine distance

            spain              0.678515
          belgium              0.665923
      netherlands              0.652428
            italy              0.633130
      switzerland              0.622323
       luxembourg              0.610033
         portugal              0.577154
           russia              0.571507
          germany              0.563291
        catalonia              0.534176

So, from the given probabilities, how do I link it to other words, and how do I know which one belongs to which?

like image 366
Sarp Kaya Avatar asked Jan 22 '26 23:01

Sarp Kaya


2 Answers

First of all, you should know that word2vec is made for research purposes. It generates vector representation for words. e.g. if you train a model with 50-dimension, you will get a 50 number in front of each word (like 'france'). For the sake of this example just imagine 2-dimension vector space, and two words 'france' and 'spain':

france -0.1 0.2
spain -0.3 0.15

The cosine similarity (they call it distance) is the normalized dot product between two vectors. Basically, sum of products of corresponding numbers divided by length of both vectors. The cosine similarity of these two vectors is:

    ( (-0.1) * (-0.3) + (0.2) * (0.15) )
    / sqrt((-0.1) * (-0.1) + (0.2) * (0.2))
    / sqrt((-0.3) * (-0.3) + (0.15) * (0.15)) = 0.8

So, those numbers in distance tool kit are not probabilities. (As a reminder, cosine can be negative). The similarity measure intuitionally is the only thing you can get when you want to connect two words. But Mikolov (the word2vec creator) showed that in word2vec models you can actually have other semantic relations through simple vector operations (the paper). If you are looking for such tool in word2vec, it's called word-analogy. It will return list of words for 3 given words. (the famous: king - man + woman = queen)

like image 68
Mehdi Avatar answered Jan 25 '26 13:01

Mehdi


I believe what you get for france is the vector associated with that word. Cosine similarity can be computed from those vectors as http://en.wikipedia.org/wiki/Cosine_similarity. These have no probabilistic interpretation as far a I know. You can retrieve semantically close words by setting a threshold on the cosine similarity of the vectors.

like image 42
Luca Fiaschi Avatar answered Jan 25 '26 13:01

Luca Fiaschi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!