I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user. I am using Scipy api for clustering. Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.
import scipy.cluster.hierarchy as sch
Y = sch.distance.pdist( allUserVector,'cosine')
set_printoptions(threshold='nan')
print Y
But my program gives me MemoryError while calculating the distance matrix itself
File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
separator, prefix)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
format_function = FloatFormat(data, precision, suppress_small)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
self.fillFormat(data)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
MemoryError
Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error. I am running it on i3 and 4 Gb Ram. I need to apply DBScan clustering too, but that too needs distance matrix as input.
Any suggestions appreciated.
Edit: I get the error only when I print Y. Any ideas why?
Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3)
which makes it scale really bad to large datasets.
DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2)
. And even then, you can safe the O(n^2)
memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2)
at the same CPU cost of computing the distance matrix.
Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n)
runtime.
On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2)
implementation. Actually the only one I've seen so far that is not the naive O(n^3)
matrix-editing approach.
If you are willing to go beyond python, that might be a good choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With