Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MemoryError in Python while using cKDTree().query_ball_tree

I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.

I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).

def Construct_SearchTree(AllXyPoints):
    KDsearch = cKDTree(AllXyPoints)  
    return KDsearch.query_ball_tree(KDsearch, Maxdist)

My questions:

1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.

2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?

Thanks for your incoming help and advice.

like image 603
Eelco Verschelling Avatar asked Oct 04 '22 08:10

Eelco Verschelling


1 Answers

I experienced a MemoryError with SciPy's cKDTree during construction and scikit-learn's KDTree when calling .query_radius(). I found that Scikit-learn's BallTree was more memory efficient, and using a BallTree solved the issue for me. I've tested BallTree with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError.

Queries on a BallTree will not be as fast as a KDTree since your data is 2D, and BallTrees are slower than KDTrees when d <= 3 (see explanation here). However, given that cKDtree and scikit-learn's KDTree both raise MemorErrors (on my system, anyway), the simplest solution is to use a BallTree.

from sklearn.neighbors import BallTree
import numpy as np

max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)

neighbors = ball_tree.query_radius(points, max_dist)

Depending on your Maxdist, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius() returns an np.array of np.arrays rather than a list of lists so it should save you some memory there (see this answer for an explanation).

like image 189
JaminSore Avatar answered Oct 12 '22 10:10

JaminSore