dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters) km.fit(dataset) prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)): cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6 B 2 3 4 5 6 7 C 1 4 2 7 8 1 ...
where A,B,C are indices
Is this the correct way of using k-means?
Generally, scikit-learn works on any numeric data stored as numpy arrays or scipy sparse matrices. Other types that are convertible to numeric arrays such as pandas DataFrame are also acceptable.
The KMeans class from the sklearn. cluster module from the Scikit-learn library is used for k-means clustering. You can see that the class is imported in the following script. The make_blobs() method from the sklearn.
'k-means++' : selects initial cluster centroids using sampling based on an empirical probability distribution of the points' contribution to the overall inertia. This technique speeds up convergence, and is theoretically proven to be -optimal.
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix mat = dataset.values # Using sklearn km = sklearn.cluster.KMeans(n_clusters=5) km.fit(mat) # Get cluster assignment labels labels = km.labels_ # Format results as a DataFrame results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With