I want to understand what is meant by "dimensionality" in word embeddings.
When I embed a word in the form of a matrix for NLP tasks, what role does dimensionality play? Is there a visual example which can help me understand this concept?
If we're in a hurry, one rule of thumb is to use the fourth root of the total number of unique categorical elements while another is that the embedding dimension should be approximately 1.6 times the square root of the number of unique elements in the category, and no less than 600.
"Word Vector Dimension" is the dimension of the vector that you have trained with the training document. Technically you can choose any dimension, like 10, 100, 300, even 1000. Industry norm is 300-500 because we have experimented with different dimensions (300, 400, 500, ... 1000, etc.)
Basically it's a mapping of the original input data into some set of real-valued dimensions, and the "position" of the original input data in those dimensions is organized to improve the task.
A Word Embedding is just a mapping from words to vectors. Dimensionality in word embeddings refers to the length of these vectors.
These mappings come in different formats. Most pre-trained embeddings are
available as a space-separated text file, where each line contains a word in the
first position, and its vector representation next to it. If you were to split
these lines, you would find out that they are of length 1 + dim
, where dim
is the dimensionality of the word vectors, and 1
corresponds to the word being represented. See the GloVe pre-trained
vectors for a real example.
For example, if you download glove.twitter.27B.zip
, unzip it, and run the following python code:
#!/usr/bin/python3
with open('glove.twitter.27B.50d.txt') as f:
lines = f.readlines()
lines = [line.rstrip().split() for line in lines]
print(len(lines)) # number of words (aka vocabulary size)
print(len(lines[0])) # length of a line
print(lines[130][0]) # word 130
print(lines[130][1:]) # vector representation of word 130
print(len(lines[130][1:])) # dimensionality of word 130
you would get the output
1193514
51
people
['1.4653', '0.4827', ..., '-0.10117', '0.077996'] # shortened for illustration purposes
50
Somewhat unrelated, but equally important, is that lines in these files are sorted according to the word frequency found in the corpus in which the embeddings were trained (most frequent words first).
You could also represent these embeddings as a dictionary where the keys are the words and the values are lists representing word vectors. The length of these lists would be the dimensionality of your word vectors.
A more common practice is to represent them as matrices (also called lookup
tables), of dimension (V x D)
, where V
is the vocabulary size (i.e., how
many words you have), and D
is the dimensionality of each word vector. In
this case you need to keep a separate dictionary mapping each word to its
corresponding row in the matrix.
Regarding your question about the role dimensionality plays, you'll need some theoretical background. But in a few words, the space in which words are embedded presents nice properties that allow NLP systems to perform better. One of these properties is that words that have similar meaning are spatially close to each other, that is, have similar vector representations, as measured by a distance metric such as the Euclidean distance or the cosine similarity.
You can visualize a 3D projection of several word embeddings here, and see, for example, that the closest words to "roads" are "highways", "road", and "routes" in the Word2Vec 10K
embedding.
For a more detailed explanation I recommend reading the section "Word Embeddings" of this post by Christopher Olah.
For more theory on why using word embeddings, which are an instance of distributed representations, is better than using, for example, one-hot encodings (local representations), I recommend reading the first sections of Distributed Representations by Geoffrey Hinton et al.
Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm. Word Embedding is an approach for this where each word is mapped to a vector.
In algebra, A Vector is a point in space with scale & direction. In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array.
Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With