Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representation and a good similarity measure between Tweets for topic detection

I'm planning to write a tool for Topic Detection on Twitter. I've been thinking about a good similarity measure (distance) between two tweets, and how to represent them, taking in count:

  • The #hashtags (I think hashtags are very important when detecting topics on Twitter)
  • The replies (if someone replies to a tweet, those tweets could be talking about the same topic, although two people could start talking about samsung galaxy and end talking about iphone jailbreaking, etc.)

I'm thinking about implementing what I have so far and do some experiments. I'll implement the classic models (like TF*IDF and use the euclidian distance, angle cosine, etc.), and the boolean models with a few similarity measures (Hamming, Jaccard, etc.).

Any ideas of how to adapt some existing model to Twitter or a few ideas about how to create a new one?

like image 542
Oscar Mederos Avatar asked Feb 06 '13 10:02

Oscar Mederos


1 Answers

Similarity Metrics on Twitter discusses some details about the different similarity measures that you can use for clustering data from twitter together. We did some research on clustering users on twitter based on the user connections, user mentions, geo-location, the content similarity between tweets, content similarity between user descriptions and the common #hashtags.

For finding common topics on twitter, finding connections between the users discussing about the topics really helps and we found that group of users tend to discuss a common topic. There is some detail about this in the second half of this post.

like image 164
Pulkit Goyal Avatar answered Nov 08 '22 20:11

Pulkit Goyal