I am trying to build a clustering algorithm for categorical data.
I have read about different algorithm's like k-modes, ROCK, LIMBO, however I would like to build one of mine and compare the accuracy and cost to others.
I have (m) training set and (n=22) features
My approach is simple:
The similarity matrix I create in step 1 would be used while performing the k-means algorithm
total_columns=22
for i in range(0,data_set):
for j in range(0,data_set):
if j>=i:
# Calculating jaccard similarity between two data rows i and j
for column in data_set.columns:
if data_orig[column][j]==data_new[column][i]:
common_count=common_count+1
probability=common_count/float(total_columns)
fnl_matrix[i][j] =probability
fnl_matrix[j][i] =probability
Partial snapshot of my fnl_matrix
(for 6 rows) is given below:
The problem I face is that when I create the (m*m) matrix, for a larger data set my performance goes for a toss. Even for a smaller dataset with 8000 rows the creation of the similarity matrix takes unbearable time. Is there any way I could tune my code or do something to the matrix that would be cost efficient.
Interpreted Python code is slow. Really slow.
That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process.
You may be able to speed up your code substantially if you try to use as much numpy
as possible. Or if you use Cython instead.
Instead of fighting centroids, consider using an distance-based clustering algorithm:
First of all, the way you calculate Jaccard seems to be inefficient (if not erroneous). You are using the for
loop that is probably the slowest way to do stuff in Python. I would recommend you to utilize Python's set
to store the rows. Sets provide fast intersection, because they are hash-tables and all the calculations are performed in C/C++ not in Python itself. Imagine r1
and r2
are two rows.
r1 = set(some_row1)
r2 = set(some_row2)
intersection_len = len(r1.intersect(r2))
union_len = len(r1) + len(r2) - intersection_len
jaccard = intersection_len / union_len
Set construction is expensive, thus you should initially store all your rows as sets. Then you should get rid of the
for i in range(0,data_set):
for j in range(0,data_set):
part as well. Use itertools
instead. Let's assume data_set is a list of rows.
for row1, row2 in itertools.combinations(data_set, r=2):
...
This thing runs a lot faster and kills the need for the if j>=i
check. This way you get the upper triangle of the matrix. Let's make a sketch of the final algorithm. Update: adding numpy
.
from scipy.spatial import distance
from itertools import combinations
import numpy as np
def jaccard(set1, set2):
intersection_len = set1.intersection(set2)
union_len = len(set1) + len(set2) - intersection_len
return intersection_len / union_len
original_data_set = [row1, row2, row3,..., row_m]
data_set = [set(row) for row in original_data_set]
jaccard_generator = (jaccard(row1, row2) for row1, row2 in combinations(data_set, r=2))
flattened_matrix = np.fromiter(jaccard_generator, dtype=np.float64)
# since flattened_matrix is the flattened upper triangle of the matrix
# we need to expand it.
normal_matrix = distance.squareform(flattened_matrix)
# replacing zeros with ones at the diagonal.
normal_matrix += np.identity(len(data_set))
That's it. You've got your matrix. From this point you might consider taking this block of code and porting it to Cython (there isn't much work left to do, you only need to define the jaccard
function in a slightly different way, i.e. add type declarations for local variables). Something like:
cpdef double jaccard(set set1, set set2):
cdef long intersection_len, union_len # or consider int
intersection_len = set1.intersection(set2)
union_len = len(set1) + len(set2) - intersection_len
return intersection_len / union_len
but I'm not sure this will compile correctly (my Cython experience is very limited)
P.S.
You can use numpy
arrays instead of set
s, because they provide a similar intersection method and run in C/C++ as well, but intersection of two arrays takes roughly O(n^2) time, while intersection of two hash-tables (set
objects) takes O(n) time, provided the collision rate is close to zero.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With