Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Better text documents clustering than tf/idf and cosine similarity?

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

My question: is there better techniques to cluster documents?

like image 713
Jack Twain Avatar asked Jul 08 '13 23:07

Jack Twain


People also ask

What is the best clustering algorithm for text data?

The most popular algorithms for clustering are K-means and its variants such as bisecting K-means and K-medoids [4]. The K-means algorithm is a simple, fast, and unsupervised partitioning algorithm offering easily parallelized and intuitive results [5].

Which is better than TF-IDF?

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data.

Can cosine similarity be used for clustering?

Soft cosines can be a great feature if you want to use a similarity metric that can help in clustering or classification of documents.

What is the cosine similarity between two text documents?

Cosine similarity is the measure of the cosine of angle between two vectors; in our case the two vectors are text documents, which are represented as vector of tf-idf weights. The cosine angle is the measure of overlap between the documents in terms of their content.

What is the difference between tf-idf and cosine similarity?

Xeon is right in what TF-IDF and cosine similarity are two different things. TF-IDF will give you a representation for a given term in a document. Cosine similarity will give you a score for two different documents that share the same representation.

What is cosine similarity in machine learning?

It is one of the central concepts in: Cosine similarity is the measure of the cosine of angle between two vectors; in our case the two vectors are text documents, which are represented as vector of tf-idf weights. The cosine angle is the measure of overlap between the documents in terms of their content.

What is the difference between frequently stemming and tf-idf?

Frequently stemming is used as a computationally faster alternative, however less accurate one. Once again we use nltk to lemmatise words How to vectorise text data using TF-IDF? TF-IDF stands for term frequency-inverse document frequency and it is a numerical measure of how relevant a keyword is to a document in some specific set of documents.


2 Answers

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

like image 147
Fred Foo Avatar answered Sep 19 '22 19:09

Fred Foo


As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.

If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.

While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.

Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.

  1. "Steve Jobs was the CEO at Apple" is clearly about the company
  2. "I'm eating the most delicious apple" is clearly about the fruit
  3. "I'm going to the big apple when I travel to the USA" is most likely about visiting New York
like image 28
ilikedata Avatar answered Sep 17 '22 19:09

ilikedata