EDIT: I think, from my discussion below with @joran , that @joran helped me figure out how <code>dist</code> is altering the distance value (it appears to be scaling the sum of the squares of the coordinates by the value [total dimensions]/[non-missing dimensions], but that is just a guess). What I'd like to know, if anyone does know, are: is that what is really going on? If so, why is that considered a reasonable thing to do? Can there, or should there be options to <code>dist</code> to compute it the way I proposed (that question might be to vague or of an opinionated nature to answer, though). I was wondering how the <code>dist</code> function actually works on vectors that have missing values. Below is a recreated example. I use the <code>dist</code> function and a more fundamental implementation of what I believe should be the definition of Euclidian distance with sqrt, sum, and powers. I also expected that if a component of either vector was <code>NA</code>, that that dimension would just be thrown out of the sum, which is how I implemented it. But you can see that that definition doesn't agree with <code>dist</code>. I will be using my basic implementation to handle the <code>NA</code> values, but I was wondering how <code>dist</code> is actually arriving at a value when vectors have <code>NA</code>, and why it doesn't agree with how I calculate it below. I would think that my basic implementation should be the default/common one, and I can't figure out what alternate method <code>dist</code> is using to get what it is getting. Thanks, Matt <pre class="prettyprint"><code>v1 <- c(1,1,1) v2 <- c(1,2,3) v3 <- c(1,NA,3) # Agree on vectors with non-missing components # -------------------------------------------- dist(rbind(v1, v2)) # v1 # v2 2.236068 sqrt(sum((v1 - v2)^2, na.rm=TRUE)) # [1] 2.236068 # But they don't agree when there is a missing component # Under what logic does sqrt(6) make sense as the answer for dist? # -------------------------------------------- dist(rbind(v1, v3)) # v1 # v3 2.44949 sqrt(sum((v1 - v3)^2, na.rm=TRUE)) # [1] 2 </code></pre>

Yes, the scaling happens exactly like you described. Maybe this is a better example: <pre class="prettyprint"><code>set.seed(123) v1 <- sample(c(1:3, NA), 100, TRUE) v2 <- sample(c(1:3, NA), 100, TRUE) dist(rbind(v1, v2)) # v1 # v2 12.24745 na.idx <- is.na(v1) | is.na(v2) v1a <- v1[!na.idx] v2a <- v2[!na.idx] sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a)) # [1] 12.24745 </code></pre> The scaling makes sense to me. All things being equal, the distance increases as the number of dimensions increases. If somewhere you have a <code>NA</code> for dimension <code>i</code>, a reasonable guess for the contribution of dimension <code>i</code> to the squared sum is the mean contribution of all other dimensions. Hence the linear up-scaling. While you are suggesting that when you find a <code>NA</code> for dimension <code>i</code>, that dimension should not contribute to the squared sum. It is like assuming that <code>v1[i] == v2[i]</code> which is totally different. To summarize <code>dist</code> is doing some type of maximum-likelihood estimation, while your suggestion is more like a worst (or best) case scenario.

Function `dist` not behaving as expected on vectors with missing values

Tags:

r

EDIT: I think, from my discussion below with @joran , that @joran helped me figure out how dist is altering the distance value (it appears to be scaling the sum of the squares of the coordinates by the value [total dimensions]/[non-missing dimensions], but that is just a guess). What I'd like to know, if anyone does know, are: is that what is really going on? If so, why is that considered a reasonable thing to do? Can there, or should there be options to dist to compute it the way I proposed (that question might be to vague or of an opinionated nature to answer, though).

I was wondering how the dist function actually works on vectors that have missing values. Below is a recreated example. I use the dist function and a more fundamental implementation of what I believe should be the definition of Euclidian distance with sqrt, sum, and powers. I also expected that if a component of either vector was NA, that that dimension would just be thrown out of the sum, which is how I implemented it. But you can see that that definition doesn't agree with dist.

I will be using my basic implementation to handle the NA values, but I was wondering how dist is actually arriving at a value when vectors have NA, and why it doesn't agree with how I calculate it below. I would think that my basic implementation should be the default/common one, and I can't figure out what alternate method dist is using to get what it is getting.

Thanks, Matt

v1 <- c(1,1,1)
v2 <- c(1,2,3)
v3 <- c(1,NA,3)

# Agree on vectors with non-missing components
# --------------------------------------------
dist(rbind(v1, v2))
#          v1
# v2 2.236068

sqrt(sum((v1 - v2)^2, na.rm=TRUE))
# [1] 2.236068



# But they don't agree when there is a missing component
# Under what logic does sqrt(6) make sense as the answer for dist?
# --------------------------------------------
dist(rbind(v1, v3))
#         v1
# v3 2.44949

sqrt(sum((v1 - v3)^2, na.rm=TRUE))
# [1] 2

531

asked Aug 08 '13 02:08

mpettis

1 Answers

Yes, the scaling happens exactly like you described. Maybe this is a better example:

set.seed(123)
v1 <- sample(c(1:3, NA), 100, TRUE)
v2 <- sample(c(1:3, NA), 100, TRUE)

dist(rbind(v1, v2))
#          v1
# v2 12.24745

na.idx <- is.na(v1) | is.na(v2) 
v1a  <- v1[!na.idx]
v2a  <- v2[!na.idx]

sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a))
# [1] 12.24745

The scaling makes sense to me. All things being equal, the distance increases as the number of dimensions increases. If somewhere you have a NA for dimension i, a reasonable guess for the contribution of dimension i to the squared sum is the mean contribution of all other dimensions. Hence the linear up-scaling.

While you are suggesting that when you find a NA for dimension i, that dimension should not contribute to the squared sum. It is like assuming that v1[i] == v2[i] which is totally different.

To summarize dist is doing some type of maximum-likelihood estimation, while your suggestion is more like a worst (or best) case scenario.

answered Oct 14 '22 03:10

flodel

Related questions
                            
                                Concatenate list of lists in R
                            
                                How can I make geom_boxplot outliers "line up" with jittered geom_points?
                            
                                sorting numerical values R
                            
                                Plotting continuous and discrete series in ggplot with facet
                            
                                Apply in R: recursive function that operates on its own previous result
                            
                                Collapse data.table column values while grouping
                            
                                Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set
                            
                                Product of likelihood too small - R only gives 0
                            
                                Violin Plot (geom_violin) with aggregated values
                            
                                Compare matrix with elements in vector by row
                            
                                How to construct a function call to pmax from the columns of a matrix
                            
                                logical test if object is a directory
                            
                                Install particular version(2.15.2) of r-base on ubuntu
                            
                                Understand and avoid infinite recursion R
                            
                                lubridate errors in R
                            
                                write.xlsx outputting merged cells directly from R
                            
                                How to build a layered plot step by step using grid in knitr?
                            
                                Nested lists: how to define the size before entering data
                            
                                Graph Visualization with igraph and R
                            
                                How to call a function that returns multiple rows and columns in a data.table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With