I'm planning to write a tool for Topic Detection on Twitter. I've been thinking about a good similarity measure (distance) between two tweets, and how to represent them, taking in count:
#hashtags
(I think hashtags are very important when detecting topics on Twitter)I'm thinking about implementing what I have so far and do some experiments. I'll implement the classic models (like TF*IDF
and use the euclidian distance, angle cosine, etc.), and the boolean models with a few similarity measures (Hamming, Jaccard, etc.).
Any ideas of how to adapt some existing model to Twitter or a few ideas about how to create a new one?
Similarity Metrics on Twitter discusses some details about the different similarity measures that you can use for clustering data from twitter together. We did some research on clustering users on twitter based on the user connections, user mentions, geo-location, the content similarity between tweets, content similarity between user descriptions and the common #hashtags.
For finding common topics on twitter, finding connections between the users discussing about the topics really helps and we found that group of users tend to discuss a common topic. There is some detail about this in the second half of this post.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With