Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error with knn function

Tags:

r

knn

I try to run this line :

knn(mydades.training[,-7],mydades.test[,-7],mydades.training[,7],k=5)

but i always get this error :

Error in knn(mydades.training[, -7], mydades.test[, -7], mydades.training[,  : 
  NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[,  :
  NAs introduced by coercion
2: In knn(mydades.training[, -7], mydades.test[, -7], mydades.training[,  :
  NAs introduced by coercion

Any idea please ?

PS : mydades.training and mydades.test are defined as follow :

N <- nrow(mydades) 
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]
like image 504
user2443456 Avatar asked Jun 01 '13 15:06

user2443456


People also ask

What is error in KNN?

Training error here is the error you'll have when you input your training set to your KNN as test set. When K = 1, you'll choose the closest training sample to your test sample. Since your test sample is in the training dataset, it'll choose itself as the closest and never make mistake.

What affects the accuracy of KNN?

Classification accuracy of the KNN algorithm is affected by the number of nearest neighbour for predicting points. The idea behind nearest neighbour classification consists in finding a number, i.e. the 'k'—of training data point nearest in distance to a predicting data, which has to be labelled.

Why is KNN inaccurate?

Doesn't work well with a large dataset: Since KNN is a distance-based algorithm, the cost of calculating distance between a new point and each existing point is very high which in turn degrades the performance of the algorithm.


1 Answers

I suspect that your issue lies in having non-numeric data fields in 'mydades'. The error line:

NA/NaN/Inf in foreign function call (arg 6)

makes me suspect that the knn-function call to the C language implementation fails. Many functions in R actually call underlying, more efficient C implementations, instead of having an algorithm implemented in just R. If you type just 'knn' in your R console, you can inspect the R implementation of 'knn'. There exists the following line:

 Z <- .C(VR_knn, as.integer(k), as.integer(l), as.integer(ntr), 
        as.integer(nte), as.integer(p), as.double(train), as.integer(unclass(clf)), 
        as.double(test), res = integer(nte), pr = double(nte), 
        integer(nc + 1), as.integer(nc), as.integer(FALSE), as.integer(use.all))

where .C means that we're calling a C function named 'VR_knn' with the provided function arguments. Since you have two of the errors

NAs introduced by coercion

I think two of the as.double/as.integer calls fail, and introduce NA values. If we start counting the parameters, the 6th argument is:

as.double(train)

that may fail in cases such as:

# as.double can not translate text fields to doubles, they are coerced to NA-values:
> as.double("sometext")
[1] NA
Warning message:
NAs introduced by coercion
# while the following text is cast to double without an error:
> as.double("1.23")
[1] 1.23

You get two of the coercion errors, which are probably given by 'as.double(train)' and 'as.double(test)'. Since you did not provide us with exact details of how 'mydades' is, here are some of my best guesses (and an artificial multivariate normal distribution data):

library(MASS)
mydades <- mvrnorm(100, mu=c(1:6), Sigma=matrix(1:36, ncol=6))
mydades <- cbind(mydades, sample(LETTERS[1:5], 100, replace=TRUE))

# This breaks knn
mydades[3,4] <- Inf
# This breaks knn
mydades[4,3] <- -Inf
# These, however, do not introduce the coercion for NA-values error message

# This breaks knn and gives the same error; just some raw text
mydades[1,2] <- mydades[50,1] <- "foo"
mydades[100,3] <- "bar"

# ... or perhaps wrongly formatted exponential numbers?
mydades[1,1] <- "2.34EXP-05"

# ... or wrong decimal symbol?
mydades[3,3] <- "1,23" 
# should be 1.23, as R uses '.' as decimal symbol and not ','

# ... or most likely a whole column is non-numeric, since the error is given twice (as.double problem both in training AND test set)
mydades[,1] <- sample(letters[1:5],100,replace=TRUE)

I would not keep both the numeric data and class labels in a single matrix, perhaps you could split the data as:

mydadesnumeric <- mydades[,1:6] # 6 first columns
mydadesclasses <- mydades[,7]

Using calls

str(mydades); summary(mydades)

may also help you/us in locating the problematic data entries and correct them to numeric entries or omitting non-numeric fields.

The rest of the run code (after breaking the data), as provided by you:

N <- nrow(mydades) 
permut <- sample(c(1:N),N,replace=FALSE)
ord <- order(permut)
mydades.shuffled <- mydades[ord,]
prop.train <- 1/3
NOMBRE <- round(prop.train*N)
mydades.training <- mydades.shuffled[1:NOMBRE,]
mydades.test <- mydades.shuffled[(NOMBRE+1):N,]

# 7th column seems to be the class labels
knn(train=mydades.training[,-7],test=mydades.test[,-7],mydades.training[,7],k=5)
like image 192
Teemu Daniel Laajala Avatar answered Oct 06 '22 08:10

Teemu Daniel Laajala