Memory Error when calculating pairwise distances in scipy

Question

I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user. I am using Scipy api for clustering. Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.

  import scipy.cluster.hierarchy as sch
  Y = sch.distance.pdist( allUserVector,'cosine')
  set_printoptions(threshold='nan')
  print Y

But my program gives me MemoryError while calculating the distance matrix itself

  File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
    return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
  File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
    separator, prefix)
  File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
    format_function = FloatFormat(data, precision, suppress_small)
  File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
    self.fillFormat(data)
  File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
    non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
MemoryError

Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error. I am running it on i3 and 4 Gb Ram. I need to apply DBScan clustering too, but that too needs distance matrix as input.

Any suggestions appreciated.

Edit: I get the error only when I print Y. Any ideas why?

Has QUIT--Anony-Mousse · Accepted Answer

Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3) which makes it scale really bad to large datasets.

DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2) memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2) at the same CPU cost of computing the distance matrix.

Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n) runtime.

On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2) implementation. Actually the only one I've seen so far that is not the naive O(n^3) matrix-editing approach. If you are willing to go beyond python, that might be a good choice.

Memory Error when calculating pairwise distances in scipy

Tags:

python

numpy

scipy

cluster-analysis

Maxwell

Video Answer

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

Memory Error when calculating pairwise distances in scipy

Tags:

python

numpy

scipy

cluster-analysis

Maxwell

Video Answer

1 Answers

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us