Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does a weighted word embedding mean?

In the paper that I am trying to implement, it says,

In this work, tweets were modeled using three types of text representation. The first one is a bag-of-words model weighted by tf-idf (term frequency - inverse document frequency) (Section 2.1.1). The second represents a sentence by averaging the word embeddings of all words (in the sentence) and the third represents a sentence by averaging the weighted word embeddings of all words, the weight of a word is given by tf-idf (Section 2.1.2).

I am not sure about the third representation which is mentioned as the weighted word embeddings which is using the weight of a word is given by tf-idf. I am not even sure if they can used together.

like image 417
Dawn17 Avatar asked Dec 09 '17 09:12

Dawn17


People also ask

What is the purpose of word embedding?

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

What is TF IDF weighted w2v?

TFIDF weighted Word2Vec in this method first, we will calculate tfidf value of each word. than follow the same approach as above section by multiplying tfidf value with the corresponding word and then divided the sum by sum tfidf value.

What is word embedding example?

Thus by using word embeddings, words that are close in meaning are grouped near to one another in vector space. For example, while representing a word such as frog, the nearest neighbour of a frog would be frogs, toads, Litoria.

What is the difference between word embedding and Word2Vec?

Even though Word2Vec is an unsupervised model where you can give a corpus without any label information and the model can create dense word embeddings, Word2Vec internally leverages a supervised classification model to get these embeddings from the corpus.


2 Answers

Averaging (possibly weighted) of word embeddings makes sense, though depending on the main algorithm and the training data this sentence representation may not be the best. The intuition is the following:

  • You might want to handle sentences of different length, hence the averaging (better than plain sum).
  • Some words in a sentence are usually much more valuable than others. TF-IDF is the simplest measure of the word value. Note that the scale of the result doesn't change.

See also this paper by Kenter et al. There is a nice post that performs the comparison of these two approaches in different algorithms, and concludes that none is significantly better than the other: some algorithms favor simple averaging, some algorithms perform better with TF-IDF weighting.

like image 157
Maxim Avatar answered Sep 20 '22 04:09

Maxim


In this article or this one, we use weighted sums, idf weighting and Part-of-speech weighting and a mixed method which use both. The mixed method is the best and help us to be first in the SemEval 2017 similarity task for english-spanish and for arabic-arabic (actually we were officially second for arabic because we did not send the mixed method for some reasons).

It is very easy to implement and to use, you have formula in the article but in a nutshell, the vector of a sentence is simply V = sum_i^k=1 Posweight(w_i) * IDFWeight(w_i) * V_i

like image 32
Didier Schwab Avatar answered Sep 20 '22 04:09

Didier Schwab