using k-NN in R with categorical values

Question

I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.

I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance. Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?

I'm using R 2.15

Backlin · Accepted Answer

As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.

# Generate dummy data
y <- rep(1:2, each=50)                          # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5  # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)

# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)

# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
    tab <- table(y[design.set][nn])
    as.integer(names(tab)[which.max(tab)])      # This is a pretty dirty line
}

# Inspect the results
table(pred, y[test.set])

If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.

The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.

using k-NN in R with categorical values

Tags:

r

distance

knn

Omri374

1 Answers

Backlin

Recent Activity

Donate For Us

using k-NN in R with categorical values

Tags:

r

distance

knn

Omri374

1 Answers

Backlin

Related questions

Recent Activity

Donate For Us