word2vec is a open source tool by Google:
For each word it provides a vector of float values, what exactly do they represent?
There is also a paper on paragraph vector can anyone explain how they are using word2vec in order to obtain fixed length vector for a paragraph.
TLDR: Word2Vec is building word projections (embeddings) in a latent space of N dimensions, (N being the size of the word vectors obtained). The float values represents the coordinates of the words in this N dimensional space.
The major idea behind latent space projections, putting objects in a different and continuous dimensional space, is that your objects will have a representation (a vector) that has more interesting calculus characteristics than basic objects.
For words, what's useful is that you have a dense vector space which encodes similarity (i.e tree has a vector which is more similar to wood than from dancing). This opposes to classical sparse one-hot or "bag-of-word" encoding which treat each word as one dimension making them orthogonal by design (i.e tree,wood and dancing all have the same distance between them)
Word2Vec algorithms do this:
Imagine that you have a sentence:
The dog has to go ___ for a walk in the park.
You obviously want to fill the blank with the word "outside" but you could also have "out". The w2v algorithms are inspired by this idea. You'd like all words that fill in the blanks near, because they belong together - This is called the Distributional Hypothesis - Therefore the words "out" and "outside" will be closer together whereas a word like "carrot" would be farther away.
This is sort of the "intuition" behind word2vec. For a more theorical explanation of what's going on i'd suggest reading:
For paragraph vectors, the idea is the same as in w2v. Each paragraph can be represented by its words. Two models are presented in the paper.
Bits from the article:
The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. [...] The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph
For full understanding on how these vectors are built you'll need to learn how neural nets are built and how the backpropagation algorithm works. (i'd suggest starting by this video and Andrew NG's Coursera class)
NB: Softmax is just a fancy way of saying classification, each word in w2v algorithms is considered as a class. Hierarchical softmax/negative sampling are tricks to speed up softmax and handle a lot of classes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With