Better text documents clustering than tf/idf and cosine similarity?

Tags:

I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.

The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:

1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website.

The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:

1- The website Stackoverflow is a nice place. 2- I visit Stackoverflow regularly.

Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.

My question: is there better techniques to cluster documents?

713

asked Jul 08 '13 23:07

Jack Twain

2 Answers

In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.

Topic models such as LDA might work even better.

147

answered Sep 19 '22 19:09

Fred Foo

As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.

If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.

While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.

Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.

"Steve Jobs was the CEO at Apple" is clearly about the company
"I'm eating the most delicious apple" is clearly about the fruit
"I'm going to the big apple when I travel to the USA" is most likely about visiting New York

answered Sep 17 '22 19:09

ilikedata

Related questions
                            
                                How can I measure the speed of code written in Java? (AI algorithms)
                            
                                TypeError: '>' not supported between instances of 'NoneType' and 'float'
                            
                                AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS'
                            
                                KMeans clustering in PySpark
                            
                                How to augment matrix factors in Spark ALS recommender? [duplicate]
                            
                                TensorFlow: How can I evaluate a validation data queue multiple times during training?
                            
                                Character-Word Embeddings from lm_1b in Keras
                            
                                Incremental training of ALS model
                            
                                How to apply machine learning to fuzzy matching
                            
                                Multiple sessions and graphs in Tensorflow (in the same process)
                            
                                What are some good ways of estimating 'approximate' semantic similarity between sentences?
                            
                                Compute the gradient of the SVM loss function
                            
                                LabelPropagation - How to avoid division by zero?
                            
                                Extract target from Tensorflow PrefetchDataset
                            
                                Why the BIAS is necessary in ANN? Should we have separate BIAS for each layer?
                            
                                Why is a simple 2-layer Neural Network unable to learn 0,0 sequence?
                            
                                Is there some .NET machine learning library that could, for example, suggest tags for a question? [closed]
                            
                                ValueError: Input 0 is incompatible with layer conv1d_1: expected ndim=3, found ndim=4
                            
                                Summarizing a Wikipedia Article
                            
                                Custom cluster colors of SciPy dendrogram in Python (link_color_func?)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Better text documents clustering than tf/idf and cosine similarity?

Tags:

machine-learning

cluster-analysis

text-mining

data-mining

Jack Twain

People also ask

2 Answers

Fred Foo

ilikedata

Recent Activity

Donate For Us