Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering Categorical data using jaccard similarity

I am trying to build a clustering algorithm for categorical data.

I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.

I have (m) training set and (n=22) features

Approach

My approach is simple:

  • Step 1: I calculate the jaccard similarity between each of my training data forming a (m*m) similarity matrix.
  • Step 2: Then I perform some operations to find the best centroids and find the clusters by using a simple k-means approach.

The similarity matrix I create in step 1 would be used while performing the k-means algorithm

Matrix creation:

total_columns=22
for i in range(0,data_set):
    for j in range(0,data_set):
        if j>=i:
            # Calculating jaccard similarity between two data rows i and j 
            for column in data_set.columns:    
                if data_orig[column][j]==data_new[column][i]:
                    common_count=common_count+1
            probability=common_count/float(total_columns)    
            fnl_matrix[i][j] =probability  
            fnl_matrix[j][i] =probability

Partial snapshot of my fnl_matrix (for 6 rows) is given below:

enter image description here

Problem Statement:

The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.

like image 271
Sam Avatar asked May 09 '15 12:05

Sam


2 Answers

Interpreted Python code is slow. Really slow.

That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process.

You may be able to speed up your code substantially if you try to use as much numpy as possible. Or if you use Cython instead.

Instead of fighting centroids, consider using an distance-based clustering algorithm:

  • Hierarchical Agglomerative Clustering (HAC), which expects a distance matrix
  • DBSCAN, which can work with arbitrary distances. It does not even need a distance matrix, only a list of similar items for some threshold.
  • K-medoids/PAM of course is worth a try, too; but usually not very fast.
like image 140
Has QUIT--Anony-Mousse Avatar answered Sep 30 '22 06:09

Has QUIT--Anony-Mousse


First of all, the way you calculate Jaccard seems to be inefficient (if not erroneous). You are using the for loop that is probably the slowest way to do stuff in Python. I would recommend you to utilize Python's set to store the rows. Sets provide fast intersection, because they are hash-tables and all the calculations are performed in C/C++ not in Python itself. Imagine r1 and r2 are two rows.

r1 = set(some_row1)
r2 = set(some_row2)
intersection_len = len(r1.intersect(r2))
union_len = len(r1) + len(r2) - intersection_len
jaccard = intersection_len / union_len

Set construction is expensive, thus you should initially store all your rows as sets. Then you should get rid of the

for i in range(0,data_set):
    for j in range(0,data_set):

part as well. Use itertools instead. Let's assume data_set is a list of rows.

for row1, row2 in itertools.combinations(data_set, r=2):
    ...

This thing runs a lot faster and kills the need for the if j>=i check. This way you get the upper triangle of the matrix. Let's make a sketch of the final algorithm. Update: adding numpy.

from scipy.spatial import distance
from itertools import combinations
import numpy as np


def jaccard(set1, set2):
    intersection_len = set1.intersection(set2)
    union_len = len(set1) + len(set2) - intersection_len
    return intersection_len / union_len

original_data_set = [row1, row2, row3,..., row_m]
data_set = [set(row) for row in original_data_set]

jaccard_generator = (jaccard(row1, row2) for row1, row2 in combinations(data_set, r=2))
flattened_matrix = np.fromiter(jaccard_generator, dtype=np.float64)

# since flattened_matrix is the flattened upper triangle of the matrix
# we need to expand it.
normal_matrix = distance.squareform(flattened_matrix)
# replacing zeros with ones at the diagonal. 
normal_matrix += np.identity(len(data_set))

That's it. You've got your matrix. From this point you might consider taking this block of code and porting it to Cython (there isn't much work left to do, you only need to define the jaccard function in a slightly different way, i.e. add type declarations for local variables). Something like:

cpdef double jaccard(set set1, set set2):
    cdef long intersection_len, union_len # or consider int 
    intersection_len = set1.intersection(set2)
    union_len = len(set1) + len(set2) - intersection_len
    return intersection_len / union_len

but I'm not sure this will compile correctly (my Cython experience is very limited)

P.S. You can use numpy arrays instead of sets, because they provide a similar intersection method and run in C/C++ as well, but intersection of two arrays takes roughly O(n^2) time, while intersection of two hash-tables (set objects) takes O(n) time, provided the collision rate is close to zero.

like image 28
Eli Korvigo Avatar answered Sep 30 '22 07:09

Eli Korvigo