Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why are word embedding actually vectors?

I am sorry for my naivety, but I don't understand why word embeddings that are the result of NN training process (word2vec) are actually vectors.

Embedding is the process of dimension reduction, during the training process NN reduces the 1/0 arrays of words into smaller size arrays, the process does nothing that applies vector arithmetic.

So as result we got just arrays and not the vectors. Why should I think of these arrays as vectors?

Even though, we got vectors, why does everyone depict them as vectors coming from the origin (0,0)?

Again, I am sorry if my question looks stupid.

like image 464
com Avatar asked Dec 10 '22 09:12

com


1 Answers

What are embeddings?

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.

(Source: https://en.wikipedia.org/wiki/Word_embedding)

What is Word2Vec?

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words.

Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

(Source: https://en.wikipedia.org/wiki/Word2vec)

What's an array?

In computer science, an array data structure, or simply an array, is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.

An array is stored so that the position of each element can be computed from its index tuple by a mathematical formula.

The simplest type of data structure is a linear array, also called one-dimensional array.

What's a vector / vector space?

A vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied ("scaled") by numbers, called scalars.

Scalars are often taken to be real numbers, but there are also vector spaces with scalar multiplication by complex numbers, rational numbers, or generally any field.

The operations of vector addition and scalar multiplication must satisfy certain requirements, called axioms, listed below.

(Source: https://en.wikipedia.org/wiki/Vector_space)

What's the difference between vectors and arrays?

Firstly, the vector in word embeddings is not exactly the programming language data structure (so it's not Arrays vs Vectors: Introductory Similarities and Differences).

Programmatically, a word embedding vector IS some sort of an array (data structure) of real numbers (i.e. scalars)

Mathematically, any element with one or more dimension populated with real numbers is a tensor. And a vector is a single dimension of scalars.


To answer the OP question:

Why are word embedding actually vectors?

By definition, word embeddings are vectors (see above)

Why do we represent words as vectors of real numbers?

To learn the differences between words, we have to quantify the difference in some manner.

Imagine, if we assign theses "smart" numbers to the words:

>>> semnum = semantic_numbers = {'car': 5, 'vehicle': 2, 'apple': 232, 'orange': 300, 'fruit': 211, 'samsung': 1080, 'iphone': 1200}
>>> abs(semnum['fruit'] - semnum['apple'])
21
>>> abs(semnum['samsung'] - semnum['apple'])
848

We see that the distance between fruit and apple is close but samsung and apple isn't. In this case, the single numerical "feature" of the word is capable of capturing some information about the word meanings but not fully.

Imagine the we have two real number values for each word (i.e. vector):

>>> import numpy as np
>>> semnum = semantic_numbers = {'car': [5, -20], 'vehicle': [2, -18], 'apple': [232, 1010], 'orange': [300, 250], 'fruit': [211, 250], 'samsung': [1080, 1002], 'iphone': [1200, 1100]}

To compute the difference, we could have done:

>>> np.array(semnum['apple']) - np.array(semnum['orange'])
array([-68, 761])

>>> np.array(semnum['apple']) - np.array(semnum['samsung'])
array([-848,    8])

That's not very informative, it returns a vector and we can't get a definitive measure of distance between the words, so we can try some vectorial tricks and compute the distance between the vectors, e.g. euclidean distance:

>>> import numpy as np
>>> orange = np.array(semnum['orange'])
>>> apple = np.array(semnum['apple'])
>>> samsung = np.array(semnum['samsung'])

>>> np.linalg.norm(apple-orange)
763.03604108849277

>>> np.linalg.norm(apple-samsung)
848.03773500947466

>>> np.linalg.norm(orange-samsung)
1083.4685043876448

Now, we can see more "information" that apple can be closer to samsung than orange to samsung. Possibly that's because apple co-occurs in the corpus more frequently with samsung than orange.

The big question comes, "How do we get these real numbers to represent the vector of the words?". That's where the Word2Vec / embedding training algorithms (originally conceived by Bengio 2003) comes in.


Taking a detour

Since adding more real numbers to the vector representing the words is more informative then why don't we just add a lot more dimensions (i.e. numbers of columns in each word vector)?

Traditionally, we compute the differences between words by computing the word-by-word matrices in the field of distributional semantics/distributed lexical semantics, but the matrices become really sparse with many zero values if the words don't co-occur with another.

Thus a lot of effort has been put into dimensionality reduction after computing the word co-occurrence matrix. IMHO, it's like a top-down view of how global relations between words are and then compressing the matrix to get a smaller vector to represent each word.

So the "deep learning" word embedding creation comes from the another school of thought and starts with a randomly (sometimes not-so random) initialized a layer of vectors for each word and learning the parameters/weights for these vectors and optimizing these parameters/weights by minimizing some loss function based on some defined properties.

It sounds a little vague but concretely, if we look at the Word2Vec learning technique, it'll be clearer, see

  • https://rare-technologies.com/making-sense-of-word2vec/
  • http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
  • https://arxiv.org/pdf/1402.3722.pdf (more mathematical)

Here's more resources to read-up on word embeddings: https://github.com/keon/awesome-nlp#word-vectors

like image 60
alvas Avatar answered Dec 12 '22 23:12

alvas