Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tag based clustering algorithm

I am looking to cluster many feeds based on their tags. A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags. An example would be -

  • Feed1 - Earthquake in indonasia #earthquake #asia #bad
  • Feed2 - There is a large earthquake in my area #earthquake #bad
  • Feed3 - My parents went to singapore #asia #tour
  • Feed4 - XYZ company is laying off many people #XYZ #layoff #bear
  • Feed5 - XYZ is getting bad is planning to layoff #XYZ #layoff #bad
  • Feed6 - XYZ is in a layoff spree #layoff #XYZ #worst

After clustering

  • #asia , # earthquake - Feed1 , Feed2
  • #XYZ , # layoff - Feed4 , Feed 5 , Feed6

Here clustering is found purely on basis of tags. Is there any good algorithm to achieve this

like image 366
Vineeth Mohan Avatar asked Feb 14 '13 14:02

Vineeth Mohan


2 Answers

If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.

For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this

               #earthquake | #asia | #bad | ...
#earthquake        1       |  1/2  |  2/2
#asia             1/2      |   1   |  1/2
#bad              2/3      |  1/3  |   1
...

Here, value at (i,j) equals frequency of (i,j)/frequency of (i).

Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).

After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.

If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.

like image 55
Pulkit Goyal Avatar answered Oct 01 '22 22:10

Pulkit Goyal


Interesting question. I'm making things up here, but I think this would work.

Algorithm

For each feed, come up with a complete list of tag combinations (of length >= 2), probably sorted for consistency. For example:

  • Feed1: (asia-bad), (asia-earthquake), (bad-earthquake), (asia-bad-earthquake)
  • Feed2: (bad-earthquake)
  • Feed3: (asia-tour)
  • Feed4: (bear-layoff), (bear-XYZ), (layoff-XYZ), (bear-layoff-XYZ)
  • Feed5: (bad-layoff), (bad-XYZ), (layoff-XYZ), (bad-layoff-XYZ)
  • Feed6: (layoff-worst), (layoff-XYZ), (worst-XYZ), (layoff-worst-XYZ)

Then reverse the mapping:

  • (asia-bad): Feed1
  • (asia-earthquake): Feed1
  • (bad-earthquake): Feed1, Feed2
  • (asia-bad-earthquake): Feed1
  • (asia-tour): Feed3
  • (bear-layoff): Feed4
  • ...
  • (layoff-XYZ): Feed4, Feed5, Feed6
  • ...

You can then cull all the entries with a frequency higher than some threshold. In this case, if we take a frequency threshold of 2, then you'd get (bad-earthquake) with Feed1 and Feed2, and (layoff-XYZ) with Feed4, Feed5 and Feed6.

Performance Concerns

A naive implementation of this would have extremely poor performance -- exponential in the number of tags per feed (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For example:

  1. Determine the most popular X tags by scanning all feeds (or a random selection of X feeds) -- this is linear in the number of tags per feed. Then only consider the Y most popular tags for each feed.
  2. Determine the frequency of all (or most) tags. Then, for each post, only consider the X most popular tags in that post. This prevents situations where you have, say, fifteen tags for some post, resulting in a huge list of combinations, most of which would never occur.
  3. For each post, only consider combinations of length <= X. For example, if a feed had fifteen tags, you could end up with a huge number of combinations, but most of them would have very few occurrences, especially the long ones. So only consider combinations of two or three tags.
  4. Only scan a random selection of X feeds.

Hope this helps!

like image 21
Eric Galluzzo Avatar answered Oct 01 '22 22:10

Eric Galluzzo