Clustering using k-means in python

Question

I have a document d1 consisting of lines of form user_id tag_id. There is another document d2 consisting of tag_id tag_name I need to generate clusters of users with similar tagging behaviour. I want to try this with k-means algorithm in python. I am completely new to this and cant figure out how to start on this. Can anyone give any pointers?

Do I need to first create different documents for each user using d1 with his tag vocabulary? And then apply k-means algorithm on these documents? There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?

Has QUIT--Anony-Mousse · Accepted Answer

Since the data you have is binary and sparse (in particular, not all users have tagged all documents, right)? So I'm not at all convinced that k-means is the proper way to do this.

Anyway, if you want to give k-means a try, have a look at the variants such as k-medians (which won't allow "half-tagging") and convex/spherical k-means (which supposedly works better with distance functions such as cosine distance, which seems a lot more appropriate here).

sravan_kumar · Answer

As mentioned by @Jacob Eggers, you have to denormalize the data to form the matrix which is a sparse one indeed. Use SciPy package in python for k means. See

Scipy Kmeans

for examples and execution. Also check Kmeans in python (Stackoverflow) for more information in python kmeans clustering.

Clustering using k-means in python

Tags:

python

tags

cluster-analysis

k-means

data-mining

Maxwell

2 Answers

Has QUIT--Anony-Mousse

sravan_kumar

Recent Activity

Donate For Us

Clustering using k-means in python

Tags:

python

tags

cluster-analysis

k-means

data-mining

Maxwell

2 Answers

Has QUIT--Anony-Mousse

sravan_kumar

Related questions

Recent Activity

Donate For Us