Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document similarity: Vector embedding versus Tf-Idf performance?

Tags:

People also ask

Why TF-IDF is better than word embedding?

There are a couple of reasons to explain why TF-IDF was superior: The Word embedding method made use of only the first 20 words while the TF-IDF method made use of all available words. Therefore the TF-IDF method gained more information from longer documents compared to the embedding method.

Which is better TF-IDF or Word2Vec?

Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other ...

Why is TF-IDF better than Word2Vec?

TF-IDF model's performance is better than the Word2vec model because the number of data in each emotion class is not balanced and there are several classes that have a small number of data. The number of surprised emotions is a minority of data which has a large difference in the number of other emotions.

Which similarity metric is usually considered suitable for measuring the similarities between TF-IDF vectors?

The cosine similarity is very popular in text analysis. It is used to determine how similar documents are to one another irrespective of their size. The TF-IDF text analysis technique helps converting the documents into vectors where each value in the vector corresponds to the TF-IDF score of a word in the document.


I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:

  1. A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.

  2. Bag-of-Words: tf-idf or its variations such as BM25.

Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?

Is there another approach, that allows to dynamically refine the document's vectors as more text is added?