Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use tf-idf with FastText vectors

I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:

  • http://dsgeek.com/2018/02/19/tfidf_vectors.html

  • https://www.aclweb.org/anthology/P16-1089

  • http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.

For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.

But for fastText same word will have several n-grams,

"Listen to the latest news summary" will have n-grams generated by a sliding windows like:

lis ist ste ten tot het...

These n-grams are handled internally by the model so when I try:

model["Listen to the latest news summary"] 

I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:

model['lis']
model['ist']
model['ten']

And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.

like image 591
Luis Ramon Ramirez Rodriguez Avatar asked Sep 23 '19 20:09

Luis Ramon Ramirez Rodriguez


People also ask

Which is better TF-IDF or Word2Vec?

TF-IDF model's performance is better than the Word2vec model because the number of data in each emotion class is not balanced and there are several classes that have a small number of data. The number of surprised emotions is a minority of data which has a large difference in the number of other emotions.

Is fastText better than GloVe?

Further- more, fastText performs better than both GloVe and Word2Vec for all the datasets. Stability of Word2Vec is inferior compared to the other two WEMs. However, its stability approaches that of GloVe with increase in the training epochs.

Why Word2Vec is better than TF-IDF?

In Word2Vec method, unlike One Hot Encoding and TF-IDF methods, unsupervised learning process is performed. Unlabeled data is trained via artificial neural networks to create the Word2Vec model that generates word vectors. Unlike other methods, the vector size is not as much as the number of unique words in the corpus.

Why TF-IDF is better than word embedding?

There are a couple of reasons to explain why TF-IDF was superior: The Word embedding method made use of only the first 20 words while the TF-IDF method made use of all available words. Therefore the TF-IDF method gained more information from longer documents compared to the embedding method.


1 Answers

I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.

That is, you send

model["Listen"]
model["to"]
model["the"]
...

to FastText, and then use your old code to get the tf-idf weights.

In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.

like image 173
HerrIvan Avatar answered Sep 27 '22 18:09

HerrIvan