I have two large sparse matrices:
In [3]: trainX
Out[3]:
<6034195x755258 sparse matrix of type '<type 'numpy.float64'>'
with 286674296 stored elements in Compressed Sparse Row format>
In [4]: testX
Out[4]:
<2013337x755258 sparse matrix of type '<type 'numpy.float64'>'
with 95423596 stored elements in Compressed Sparse Row format>
About 5 GB RAM in total to load. Note these matrices are HIGHLY sparse (0.0062% occupied).
For each row in testX
, I want to find the Nearest Neighbor in trainX
and return its corresponding label, found in trainY
. trainY
is a list with the same length as trainX
and has many many classes. (A class is made up of 1-5 separate labels, each label is one of 20,000, but the number of classes is not relevant to what I am trying to do right now.)
I am using sklearn
's KNN algorithm to do this:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=1)
clf.fit(trainX, trainY)
clf.predict(testX[0])
Even predicting for 1 item of testX
takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible). My laptop with 16GB of RAM starts to swap a bit, but does manage to complete for 1 item in testX
.
My questions is, how can I do this so it will finish in reasonable time? Say one night on a large EC2 instance? Would just having more RAM and preventing the swapping speed it up enough (my guess is no). Maybe I can somehow make use of the sparsity to speed up the calculation?
Thank you.
Classic kNN data structures such as the KD tree used in sklearn
become very slow when the dimension of the data increases. For very high-dimensional problems it is advisable to switch algorithm class and use approximate nearest neighbour (ANN) methods, which sklearn
seems to be lacking, unfortunately. See links below for papers on algorithms and theory why approximate nearest neighbors is so much faster in these cases.
A well-known ANN library in the C++ world, widely used in Computer Vision for nearest neighbors in feature descriptor spaces, is FLANN
. The homepage says it contains Python bindings (I have never worked with then).
Another popular alternative is the ANN
library with Python wrapper here, although the newer FLANN seems to be more popular at the moment.
See also this answer (but some links are dead).
One caveat: Your data seems to be very high dimensional - I don't known how these libraries perform for you. They should still beat sklearn
.
Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible).
That's exactly why all scikit-learn estimators take batches of samples in their predict
method. If you pass multiple samples in one call, the cost of input validation and Python's slow loops gets smaller, so the time per samples becomes a lot less than the cost of one sample times the number of samples.
>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.neighbors import KNeighborsClassifier
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data['data'], data['target']
>>> X = TruncatedSVD(n_components=100).fit_transform(X)
>>> clf = KNeighborsClassifier(n_neighbors=1).fit(X, y)
>>> %timeit clf.predict(X[0])
1000 loops, best of 3: 766 us per loop
>>> %timeit clf.predict(X[0:10])
100 loops, best of 3: 2.44 ms per loop
>>> %timeit clf.predict(X[0:100])
100 loops, best of 3: 14.2 ms per loop
>>> %timeit clf.predict(X[0:1000])
10 loops, best of 3: 117 ms per loop
Maybe I can somehow make use of the sparsity to speed up the calculation?
You can sample from the training set instead of using it all. k-NN performance depends on the size of the training set, which is why the vanilla k-NN algorithm isn't a very good choice for text classification.
(A favorite trick in the text processing field is to use an on-disk index to build a k-NN classifier, e.g. Lucene: use an entire document as a query, retrieve the top k documents, determine the label from those.)
As far as I know, neither FLANN nor ANN handle sparse data very well. I've just released a new C++ library for K-NN search for generic data type and generic similarity measure at www.kgraph.org. All you have to do is to plug in your function of computing a similarity between object i and object j and the library will do the rest of the magic. The downside is that you probably won't be able to gain much by using python. As the similarity computing code will be invoked extremely frequently, it doesn't make much sense to add a python API for user-provided metric.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With