Why DBSCAN clustering returns single cluster on Movie lens data set?

Question

The Scenario:

I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:

OLD FORMAT:

uid iid rat
941 1   5
941 7   4
941 15  4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4

NEW FORMAT:

uid 1               2               3               4
1   5               3               4               3
2   4               3.6185548023    3.646073985     3.9238342172
3   2.8978348799    2.6692556753    2.7693015618    2.8973463681
4   4.3320762062    4.3407749532    4.3111995162    4.3411425423
940 3.7996234581    3.4979386925    3.5707888503    2
941 5               NaN             NaN             NaN
942 4.5762594612    4.2752554573    4.2522440019    4.3761477591
943 3.8252406362    5               3.3748860659    3.8487417604

over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN. With KMeans I'm able to set and get clusters.

The Problem

The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)

Techniques Tried:

Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.

Snippet 1:

print "

 FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris

iris = load_iris()
dbscan = DBSCAN()

d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)

Snippet 1 (Output):

FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]: 
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

Snippet 2:

import pandas as pd
from sklearn.cluster import DBSCAN

data_set = pd.DataFrame

ch = int(input("Extended Cluster Methods for:
1. Main Matrix IBCF 
2. Main Matrix UBCF
Ch:"))
if ch is 1:
    data_set = pd.read_csv("MainMatrix_IBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
elif ch is 2:
    data_set = pd.read_csv("MainMatrix_UBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
else:
    print "Enter Proper choice!"

print "Starting with DBSCAN for Clustering on
", data_set.info()

db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

Snippet 2 (Output):

Extended Cluster Methods for:
1. Main Matrix IBCF 
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])

As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.

T3J45 · Accepted Answer

As pointed by @faraway and @Anony-Mousse, the solution is more Mathematical on Dataset than Programming.

Could finally figure out the clusters. Here's how:

db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d

The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.

Why DBSCAN clustering returns single cluster on Movie lens data set?

Tags:

python

pandas

cluster-analysis

dbscan

The Scenario:

The Problem

Techniques Tried:

T3J45

1 Answers

T3J45

Recent Activity

Donate For Us

Why DBSCAN clustering returns single cluster on Movie lens data set?

Tags:

python

pandas

cluster-analysis

dbscan

The Scenario:

The Problem

Techniques Tried:

T3J45

1 Answers

T3J45

Related questions

Recent Activity

Donate For Us