I am looking to cluster many feeds based on their tags. A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags. An example would be -
After clustering
Here clustering is found purely on basis of tags. Is there any good algorithm to achieve this
If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.
For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this
#earthquake | #asia | #bad | ...
#earthquake 1 | 1/2 | 2/2
#asia 1/2 | 1 | 1/2
#bad 2/3 | 1/3 | 1
...
Here, value at (i,j)
equals frequency of (i,j)/frequency of (i)
.
Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).
After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.
If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.
Interesting question. I'm making things up here, but I think this would work.
For each feed, come up with a complete list of tag combinations (of length >= 2), probably sorted for consistency. For example:
Then reverse the mapping:
You can then cull all the entries with a frequency higher than some threshold. In this case, if we take a frequency threshold of 2, then you'd get (bad-earthquake) with Feed1 and Feed2, and (layoff-XYZ) with Feed4, Feed5 and Feed6.
A naive implementation of this would have extremely poor performance -- exponential in the number of tags per feed (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For example:
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With