I have a document d1 consisting of lines of form user_id tag_id. There is another document d2 consisting of tag_id tag_name I need to generate clusters of users with similar tagging behaviour. I want to try this with k-means algorithm in python. I am completely new to this and cant figure out how to start on this. Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary? And then apply k-means algorithm on these documents? There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?
Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.
Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).
As mentioned by @Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed. Use SciPy package in python for k means. See
Scipy Kmeans
for examples and execution. Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With