I want to understand what is meant by "dimensionality" in word embeddings. When I embed a word in the form of a matrix for NLP tasks, what role does dimensionality play? Is there a visual example which can help me understand this concept?

Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm. Word Embedding is an approach for this where each word is mapped to a vector. In algebra, A Vector is a point in space with scale & direction. In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array. Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.

what is dimensionality in word embeddings?

2 Answers

Answer

A Word Embedding is just a mapping from words to vectors. Dimensionality in word embeddings refers to the length of these vectors.

Additional Info

These mappings come in different formats. Most pre-trained embeddings are available as a space-separated text file, where each line contains a word in the first position, and its vector representation next to it. If you were to split these lines, you would find out that they are of length 1 + dim, where dim is the dimensionality of the word vectors, and 1 corresponds to the word being represented. See the GloVe pre-trained vectors for a real example.

For example, if you download glove.twitter.27B.zip, unzip it, and run the following python code:

#!/usr/bin/python3

with open('glove.twitter.27B.50d.txt') as f:
    lines = f.readlines()
lines = [line.rstrip().split() for line in lines]

print(len(lines))          # number of words (aka vocabulary size)
print(len(lines[0]))       # length of a line
print(lines[130][0])       # word 130
print(lines[130][1:])      # vector representation of word 130
print(len(lines[130][1:])) # dimensionality of word 130

you would get the output

1193514
51
people
['1.4653', '0.4827', ..., '-0.10117', '0.077996']  # shortened for illustration purposes
50

Somewhat unrelated, but equally important, is that lines in these files are sorted according to the word frequency found in the corpus in which the embeddings were trained (most frequent words first).

You could also represent these embeddings as a dictionary where the keys are the words and the values are lists representing word vectors. The length of these lists would be the dimensionality of your word vectors.

A more common practice is to represent them as matrices (also called lookup tables), of dimension (V x D), where V is the vocabulary size (i.e., how many words you have), and D is the dimensionality of each word vector. In this case you need to keep a separate dictionary mapping each word to its corresponding row in the matrix.

Background

Regarding your question about the role dimensionality plays, you'll need some theoretical background. But in a few words, the space in which words are embedded presents nice properties that allow NLP systems to perform better. One of these properties is that words that have similar meaning are spatially close to each other, that is, have similar vector representations, as measured by a distance metric such as the Euclidean distance or the cosine similarity.

You can visualize a 3D projection of several word embeddings here, and see, for example, that the closest words to "roads" are "highways", "road", and "routes" in the Word2Vec 10K embedding.

For a more detailed explanation I recommend reading the section "Word Embeddings" of this post by Christopher Olah.

For more theory on why using word embeddings, which are an instance of distributed representations, is better than using, for example, one-hot encodings (local representations), I recommend reading the first sections of Distributed Representations by Geoffrey Hinton et al.

108

answered Sep 28 '22 19:09

jabalazs

Textual data has to be converted into numeric data before feeding into any Machine Learning algorithm. Word Embedding is an approach for this where each word is mapped to a vector.

In algebra, A Vector is a point in space with scale & direction. In simpler term Vector is a 1-Dimensional vertical array ( or say a matrix having single column) and Dimensionality is the number of elements in that 1-D vertical array.

Pre-trained word embedding models like Glove, Word2vec provides multiple dimensional options for each word, for instance 50, 100, 200, 300. Each word represents a point in D dimensionality space and synonyms word are points closer to each other. Higher the dimension better shall be the accuracy but computation needs would also be higher.

answered Sep 28 '22 17:09

Kaustuv

Related questions
                            
                                PDFminer: PDFTextExtractionNotAllowed Error
                            
                                Classification using movie review corpus in NLTK/Python
                            
                                Open source spell check
                            
                                Java Open Source Text Mining Frameworks [closed]
                            
                                NLP Parser in Haskell [closed]
                            
                                Extracting the relationship between entities in Stanford CoreNLP
                            
                                Extracting noun phrases from a text file using stanford typed parser
                            
                                Detect Proper Nouns with WordNet?
                            
                                Is there a way to get the subject of a sentence using OpenNLP?
                            
                                Very basic English grammar parser
                            
                                Algorithms/theory behind predictive autocomplete?
                            
                                Pooling vs Pooling-over-time
                            
                                Latent Semantic Analysis concepts
                            
                                How to conjugate a verb in NLTK given POS tag?
                            
                                How can I preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc) in one pass?
                            
                                TF2.0: Translation model: Error when restoring the saved model: Unresolved object in checkpoint (root).optimizer.iter: attributes
                            
                                How to use NLP to separate a unstructured text content into distinct paragraphs?
                            
                                What is a chunker in Natural Language Processing?
                            
                                Supervised Latent Dirichlet Allocation for Document Classification?
                            
                                Coreference resolution in python nltk using Stanford coreNLP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is dimensionality in word embeddings?

Tags:

terminology

nlp

word-embedding

dimensionality-reduction

manoveg

People also ask