I have large 2D arrays with unsorted (X,Y) points, for which I need to know which points are in close proximity to each other (nearest-neighbor lookup). I have used cKDTree and query_ball_tree with succes for arrays with around 500,000 (X,Y) points. However, when I try the same algorithm for datasets of more than 1,000,000 points, query_ball_tree results in a MemoryError.
I use 64-bit Windows with 16Gb of internal memory, and have tried both 32-bit and 64-bit versions of Python and the extension modules (scipy and numpy).
def Construct_SearchTree(AllXyPoints):
KDsearch = cKDTree(AllXyPoints)
return KDsearch.query_ball_tree(KDsearch, Maxdist)
My questions:
1) does anybody know of an alternative to cKDTree / query_ball_tree that consumes less memory? Speed is less important in this case that memory usage.
2) I hoped that switching from 32-bit to 64-bit python & extensions would solve the MemoryError. What could be the reason that it didn't?
Thanks for your incoming help and advice.
I experienced a MemoryError
with SciPy's cKDTree
during construction and scikit-learn's KDTree
when calling .query_radius()
. I found that Scikit-learn's BallTree
was more memory efficient, and using a BallTree
solved the issue for me. I've tested BallTree
with 1 million data points on my 64-bit system. It still consumes all my available memory (12GB) and some swap space, but I don't get a MemoryError
.
Queries on a BallTree
will not be as fast as a KDTree
since your data is 2D, and BallTree
s are slower than KDTree
s when d <= 3 (see explanation here). However, given that cKDtree
and scikit-learn's KDTree
both raise MemorError
s (on my system, anyway), the simplest solution is to use a BallTree
.
from sklearn.neighbors import BallTree
import numpy as np
max_dist = .1
points = np.random.normal(size=2000000).reshape(1000000, 2) #1 million points
ball_tree = BallTree(points)
neighbors = ball_tree.query_radius(points, max_dist)
Depending on your Maxdist
, the returned result can consume a lot of memory (up to O(n^2)), but scikit-learn's BallTree.query_radius()
returns an np.array
of np.array
s rather than a list
of list
s so it should save you some memory there (see this answer for an explanation).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With