Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/tutorial. In this tutorial they provide some sample code to get you started with a basic submission: <pre class="prettyprint"><code># makes the KNN submission library(FNN) train <- read.csv("c:/Development/data/digits/train.csv", header=TRUE) test <- read.csv("c:/Development/data/digits/test.csv", header=TRUE) labels <- train[,1] train <- train[,-1] results <- (0:9)[knn(train, test, labels, k = 10, algorithm="cover_tree")] write(results, file="knn_benchmark.csv", ncolumns=1) </code></pre> My questions are: <ol> <li>How can I view the nearest neighbors that have been selected for a particular test row?</li> <li>How can I modify which of those ten is selected for my <code>results</code>?</li> </ol> These questions may be too broad. If so, I would welcome any links that could point me down the right road. It is very possible that I have said something that doesn't make sense here. If this is the case, please correct me.

1) You can get the nearest neighbors of a given row like so: <pre class="prettyprint"><code>k <- knn(train, test, labels, k = 10, algorithm="cover_tree") indices <- attr(k, "nn.index") </code></pre> Then if you want the indices of the 10 nearest neighbors to row 20 in the training set: <pre class="prettyprint"><code>print(indices[20, ]) </code></pre> (You'll get the 10 nearest neighbors because you selected <code>k=10</code>). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier): <pre class="prettyprint"><code>train <- read.csv("train.csv", header=TRUE)[1:1000, ] test <- read.csv("test.csv", header=TRUE)[1:1000, ] labels <- train[,1] train <- train[,-1] k <- knn(train, test, labels, k = 10, algorithm="cover_tree") indices = attr(k, "nn.index") print(indices[20, ]) # output: # [1] 829 539 784 487 293 882 367 268 201 277 </code></pre> Those are the indices within the training set of 1000 that are closest to the 20th row of the test set. 2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this: <pre class="prettyprint"><code>closest.labels = apply(indices, 2, function(col) labels[col]) </code></pre> You can then see the labels of the 10 closest points to the 20th training point like this: <pre class="prettyprint"><code>closest.labels[20, ] # [1] 0 0 0 0 0 0 0 0 0 0 </code></pre> This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. <code>knn</code> simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer. ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this: <pre class="prettyprint"><code>dists = attr(k, "nn.dist") dists[20, ] # output: # [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730 # [9] 1630.910 1667.014 </code></pre>

How to view the nearest neighbors in R?

Tags:

r

kaggle

Let me start by saying I have no experience with R, KNN or data science in general. I recently found Kaggle and have been playing around with the Digit Recognition competition/tutorial.

In this tutorial they provide some sample code to get you started with a basic submission:

# makes the KNN submission

library(FNN)

train <- read.csv("c:/Development/data/digits/train.csv", header=TRUE)
test <- read.csv("c:/Development/data/digits/test.csv", header=TRUE)

labels <- train[,1]
train <- train[,-1]

results <- (0:9)[knn(train, test, labels, k = 10, algorithm="cover_tree")]

write(results, file="knn_benchmark.csv", ncolumns=1)

My questions are:

How can I view the nearest neighbors that have been selected for a particular test row?
How can I modify which of those ten is selected for my results?

These questions may be too broad. If so, I would welcome any links that could point me down the right road.

It is very possible that I have said something that doesn't make sense here. If this is the case, please correct me.

208

asked Aug 28 '12 05:08

Abe Miessler

1 Answers

1) You can get the nearest neighbors of a given row like so:

k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices <- attr(k, "nn.index")

Then if you want the indices of the 10 nearest neighbors to row 20 in the training set:

print(indices[20, ])

(You'll get the 10 nearest neighbors because you selected k=10). For example, if you run with only the first 1000 rows of the training and testing set (to make it computationally easier):

train <- read.csv("train.csv", header=TRUE)[1:1000, ]
test <- read.csv("test.csv", header=TRUE)[1:1000, ]

labels <- train[,1]
train <- train[,-1]

k <- knn(train, test, labels, k = 10, algorithm="cover_tree")
indices = attr(k, "nn.index")

print(indices[20, ])
# output:
#  [1] 829 539 784 487 293 882 367 268 201 277

Those are the indices within the training set of 1000 that are closest to the 20th row of the test set.

2) It depends what you mean by "modify". For starters, you can get the indices of each of the 10 closest labels to each row like this:

closest.labels = apply(indices, 2, function(col) labels[col])

You can then see the labels of the 10 closest points to the 20th training point like this:

closest.labels[20, ]
# [1] 0 0 0 0 0 0 0 0 0 0

This indicates that all 10 of the closest points to row 20 are all in the group labeled 0. knn simply chooses the label by majority vote (with ties broken at random), but you could choose some kind of weighting scheme if you prefer.

ETA: If you're interested in weighting the closer elements more heavily in your voting scheme, note that you can also get the distances to each of the k neighbors like this:

dists = attr(k, "nn.dist")
dists[20, ]
# output:
# [1] 1238.777 1243.581 1323.538 1398.060 1503.371 1529.660 1538.128 1609.730
# [9] 1630.910 1667.014

105

answered Oct 08 '22 20:10

David Robinson

Related questions
                            
                                Remove duplicates based on 2nd column condition
                            
                                How can I make R use more CPU and memory? [duplicate]
                            
                                Histogram ggplot : Show count label for each bin for each category
                            
                                R image() plots matrix rotated?
                            
                                output markdown in r code chunk
                            
                                Can't drop column - select() with dplyr
                            
                                REAL() can only be applied to a 'numeric', not a 'integer'
                            
                                Reshaping data in R with "login" "logout" times
                            
                                Changing the Appearance of Facet Labels size
                            
                                pandoc-citeproc error 83 with Rmarkdown file
                            
                                Change legend size in plotly chart
                            
                                Row operations in data.table using `by = .I`
                            
                                Shiny Slider Input step by month
                            
                                How to 'unlist' a column in a data.table
                            
                                R markdown, hiding the library output
                            
                                Suppress automatic output to console in R
                            
                                Installing the R-package "rgeos" on linux: geos-config not found or not executable
                            
                                Closest value to a specific column in R
                            
                                R merge two datasets based on specific columns with added condition
                            
                                How to read.table() multiple files into a single table in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With