Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed of K-Nearest-Neighbour build/search with SciKit-learn and SciPy

I have a large set of 2-dimensional points and want to be able to rapidly query the set for the k-Nearest-Neighbours of any point in the 2-d space. Since it's low-dimensional, a KD-Tree seems like a good way to go about it. My initial data set will only be updated very rarely, so the time to query a point should be more important to me than the build-time. However, each time I run the program I will need to re-load the object, so I also need a structure that can be saved and reloaded swiftly.

The two choices readily available are the KDTree structures in SciPy and in SciKit-learn. Below I profile the two of these for build-speed and query-speed across a large range of list lengths. I also pickle the SciKit-learn structure and show the time to re-load the object from the pickle. These are compared in a graph, and the code used to generate the timings is included below.

As I show in the graph, loading from a pickle is faster than building it from scratch by half an order of magnitude for large N, showing that the KDTree is suitable for my use case (ie. frequent re-loads but infrequent re-builds).

Comparing build-, reload- and query-time of two KD-Tree structures

Code to compare build-times:

# Profiling the building time for the two KD-tree structures and re-loading from a pickle
import math, timeit, pickle, sklearn.neighbors

the_lengths = [100, 1000, 10000, 100000, 1000000]

theSciPyBuildTime = []
theSklBuildTime = []
theRebuildTime = []

for length in the_lengths:
    dim = 5*int(math.sqrt(length))
    nTimes = 50
    from random import randint
    listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]

    setup = """import scipy.spatial
import sklearn.neighbors
length = """ + str(length) + """
dim = """ + str(dim) + """
from random import randint
listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]"""

    theSciPyBuildTime.append( timeit.timeit('scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20)', setup=setup, number=nTimes)/nTimes )
    theSklBuildTime.append( timeit.timeit('sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20)', setup=setup, number=nTimes)/nTimes )

    theTreeSkl = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20, metric='euclidean')
    f = open('temp.pkl','w')
    temp = pickle.dumps(theTreeSkl)

    theRebuildTime.append( timeit.timeit('pickle.loads(temp)', 'from __main__ import pickle,temp', number=nTimes)/nTimes )

Code to compare query-times:

# Profiling the query time for the two KD-tree structures
import scipy.spatial, sklearn.neighbors

the_lengths = [100, 1000, 10000, 100000, 1000000, 10000000]

theSciPyQueryTime = []
theSklQueryTime = []

for length in the_lengths:
    dim = 5*int(math.sqrt(length))
    nTimes = 50
    listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]

    setup = """from __main__ import sciPiTree,sklTree
from random import randint
length = """ + str(length) + """
randPoint = [randint(0,""" + str(dim) + """),randint(0,""" + str(dim) + """)]""" 

    sciPiTree = scipy.spatial.KDTree(listOfRandom2DPoints, leafsize=20)
    sklTree = sklearn.neighbors.KDTree(listOfRandom2DPoints, leaf_size=20)

    theSciPyQueryTime.append( timeit.timeit('sciPiTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes )
    theSklQueryTime.append( timeit.timeit('sklTree.query(randPoint,10)', setup=setup, number=nTimes)/nTimes )

 

Questions:

  1. The Result: Although they're getting closer for very large N, SciKit-learn seems to beat SciPy for both build time and query time. Have other people found this?

  2. The Maths: Are there any better structures available for this? I'm only working in a 2D space (although the data will be quite dense so brute force is out), is there a better structure for low-dimensional kNN searches?

  3. The Speed: It looks like the build-time for the two approaches is getting closer at large N but my computer gave up on me - can anyone verify this for me for larger N?! Thanks!! Does re-build time continue to increase roughly linearly as well?

  4. Practicalities: The SciPy KDTree won't pickle. As reported in this post, I'm given the following error "PicklingError: Can't pickle : it's not found as scipy.spatial.kdtree.innernode" - I think this is due to it being a nested structure. According to an answer reported in this post, nested structures can be pickled with dill. However, dill gives me the same error - why is this?

like image 259
StackG Avatar asked May 25 '15 23:05

StackG


1 Answers

Before I get to answers, I would like to point out that when you have a program that uses a large set of numbers you should always use numpy.array from numpy library to store that kind of data. I don't know what version of Python, scikit-learn, and SciPy are you using, but I am using Python 3.7.3, scikit-learn 0.21.3, and SciPy 1.3.0. When I ran your code to compare build-times, I got AttributeError: 'list' object has no attribute 'size'. This error is saying that list listOfRandom2DPoints has no attribute size. The problem is that sklearn.neighbors.KDTree expects numpy.array which has attribute size. Class scipy.spatial.KDTree works with Python lists but as you can see in the source code of __init__ method of class scipy.spatial.KDTree, first line is self.data = np.asarray(data), which means that data will be converted to numpy.array.

Because of this, I cahanged your lines:

from random import randint
listOfRandom2DPoints = [ [randint(0,dim),randint(0,dim)] for x in range(length)]

to:

import numpy as np
ListOfRandom2DPoints = np.random.randint(0, dim, size=(length, 2))

(This change doesn't affect speed comparisons because change is made in setup code.)

Now answers on your questions:

  1. Like you said scikit-learn seems to beet SciPy for the build time. The reason why this happens isn't that scikit-learn has a faster algorithm, but sklearn.neighbors.KDTree is implemented in Cython (link to source code), and scipy.spatial.KDTree is written in pure Python code (link to source code).

    (If you don't know what is Cython, an oversimplified explanation would be that Cython makes possible writing C code in Python and main reason for doing that is that C is much faster than Python)

    SciPy library also has implementation in Cython scipy.spatial.cKDTree (link to source code), it works the same as scipy.spatial.KDTree and if you compare build times of sklearn.neighbors.KDTree and scipy.spatial.cKDTree:

    timeit.timeit('scipy.spatial.cKDTree(npListOfRandom2DPoints, leafsize=20)', setup=setup, number=nTimes)
    timeit.timeit('sklearn.neighbors.KDTree(npListOfRandom2DPoints, leaf_size=20)', setup=setup, number=nTimes)
    

    Build times are very similar, and when I ran the code, scipy.spatial.cKDTree was a little bit (around 20%) faster.

    With query times situation is very similar, scipy.spatial.KDTree (pure Python implementation) is about ten times slower than sklearn.neighbors.KDTree (Cython implementation) and scipy.spatial.cKDTree (Cython implementation) is aproximatly as fast as sklearn.neighbors.KDTree. I have tested query times up to N = 10000000, and got the same result as you. Query times stay the same regardless of N (meaning query time for scipy.spatial.KDTree is same for N = 1000 and N = 1000000, and the same thing for query times forsklearn.neighbors.KDTree and scipy.spatial.cKDTree). That is because query (search) time complexity is O(logN) and even for N = 1000000, logN is very small so the difference is too small to measure.

  2. Build algorithm of sklearn.neighbors.KDTree (__init__ method of class) has time complexity of O(KNlogN) (about scikit-learn Nearest Neighbor Algorithms) so in your case it would be O(2NlogN) which is practically O(NlogN). Based on very similar build times of sklearn.neighbors.KDTree and scipy.spatial.cKDTree I assume that the build algorithm of scipy.spatial.cKDTree also has time complexity of O(NlogN). I am no expert on nearest neighbor search algorithms, but based on some online search, I would say that for low-dimensional nearest neighbor search algorithms this as fast as it can be. If you go to nearest neighbor search Wikipedia page you will see that there are exact methods and approximation methods. k-d tree is exact method, it is subtype of space partitioning methods. Of all space partitioning methods (only fast exact methods for nearest neighbor search based on Wikipedia page), k-d tree is the best method in the case of low-dimensional Euclidean space for nearest neighbor search in static context (there isn't a lot of insertions and deletions). Also if you look at approximation methods under greedy search in proximity neighborhood graphs you will see "Proximity graph methods are considered the current state-of-the-art for the approximate nearest neighbors search." When you look at the research article that is cited for this method (Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs) you can see that this method has time complexity of O(NlogN). This means that for low-dimensional spaces k-d tree (exact method) is as fast as approximation methods. For now, we have compared build (construction) time complexity of structures that are used for nearest neighbor searches. All these algorithms have search (query) time complexity of O(logN). So best that we can get is build complexity of O(NlogN) and query complexity of O(logN) which is what we have in k-d tree method. So based on my research I would say that k-d tree is the best structure for low-dimensional nearest neighbor searches.

    (I think if there was a better (faster) method to do nearest neighbor search, than scikit-learn and SciPy would have implemented that method. Also from a theoretical standpoint, knowing that fastest sorting algorithms have time complexity of O(NlogN), it would be quite surprising to have nearest neighbor search build algorithm with time complexity less than O(NlogN).)

  3. Like I said you are comparing sklearn.neighbors.KDTree with Cython implementation and scipy.spatial.KDTree with pure Python implementation. In theory sklearn.neighbors.KDTree should be faster than scipy.spatial.KDTree, I compared these up to 1000000 and they seem to get closer at large N. For N = 100, scipy.spatial.KDTree is about 10 times slower than sklearn.neighbors.KDTree and for N = 1000000, scipy.spatial.KDTree is about twice as slow as sklearn.neighbors.KDTree. I am not sure why is this happening, but I suspect that for big N, memory becomes a bigger problem than the number of operations.

    I checked re-build time also up to 1000000 and it does increase linearly and that is because the duration of function pickle.loads is linearly proportional to the size of the loading object.

  4. For me, pickling of sklearn.neighbors.KDTree, scipy.spatial.KDTree, and scipy.spatial.cKDTree works so I can't reproduce your error. I am guessing that the problem is that you have an older version of SciPy so updating SciPy to the newest version should fix this problem.

    (If you need more help on this problem you should add some more info to your question. What are your Python and SciPy versions, exact code to reproduce this error, and full error message?)

like image 169
ands Avatar answered Oct 16 '22 19:10

ands