Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R knn large dataset

Tags:

r

knn

I'm trying to use knn in R (used several packages(knnflex, class)) to predict the probability of default based on 8 variables. The dataset is about 100k lines of 8 columns, but my machine seems to be having difficulty with a sample of 10k lines. Any suggestions for doing knn on a dataset > 50 lines (ie iris)?

EDIT:

To clarify there are a couple issues.

1) The examples in the class and knnflex packages are a bit unclear and I was curious if there was some implementation similar to the randomForest package where you give it the variable you want to predict and the data you want to use to train the model:

RF <- randomForest(x, y, ntree, type,...) 

then turn around and use the model to predict data using the test data set:

pred <- predict(RF, testData)

2) I'm not really understanding why knn wants training AND test data for building the model. From what I can tell, the package creates a matrix ~ to nrows(trainingData)^2 which also seems to be an upper limit on the size of the predicted data. I created a model using 5000 rows (above that # I got memory allocation errors) and was unable to predict test sets > 5000 rows. Thus I would need either:

a) find a way to use > 5000 lines in a training set

or

b) find a way to use the model on the full 100k lines.

like image 903
screechOwl Avatar asked Nov 21 '11 21:11

screechOwl


1 Answers

The reason that knn (in class) asks for both the training and test data is that if it didn't, the "model" it would return would simply be the training data itself.

The training data is the model.

To make predictions, knn calculates the distance between a test observation and each training observation (although I suppose there are some fancy versions for insanely large data sets that don't check every distance). So until you have test observations, there isn't really a model to build.

The ipred package provides functions that appear structured as you describe, but if you look at them, you'll see that there is basically nothing happening in the "training" function. All the work is in the "predict" function. And those are really intended as wrappers to be used for error estimation using cross validation.

As far as limitations on the number of cases, that will be dependent on how much physical memory you have. If you're getting memory allocation errors, then you either need to reduce your RAM usage elsewhere (close apps, etc), buy more RAM, buy a new computer, etc.

The knn function in class runs fine for me with training and test data sets of 10k rows or more, although I have 8gb of RAM. Also, I suspect that knn in class will be faster than in knnflex, but I haven't done extensive testing.

like image 66
joran Avatar answered Oct 01 '22 18:10

joran