EDIT: I think, from my discussion below with @joran , that @joran helped me figure out how dist
is altering the distance value (it appears to be scaling the sum of the squares of the coordinates by the value [total dimensions]/[non-missing dimensions], but that is just a guess). What I'd like to know, if anyone does know, are: is that what is really going on? If so, why is that considered a reasonable thing to do? Can there, or should there be options to dist
to compute it the way I proposed (that question might be to vague or of an opinionated nature to answer, though).
I was wondering how the dist
function actually works on vectors that have missing values. Below is a recreated example. I use the dist
function and a more fundamental implementation of what I believe should be the definition of Euclidian distance with sqrt, sum, and powers. I also expected that if a component of either vector was NA
, that that dimension would just be thrown out of the sum, which is how I implemented it. But you can see that that definition doesn't agree with dist
.
I will be using my basic implementation to handle the NA
values, but I was wondering how dist
is actually arriving at a value when vectors have NA
, and why it doesn't agree with how I calculate it below. I would think that my basic implementation should be the default/common one, and I can't figure out what alternate method dist
is using to get what it is getting.
Thanks, Matt
v1 <- c(1,1,1)
v2 <- c(1,2,3)
v3 <- c(1,NA,3)
# Agree on vectors with non-missing components
# --------------------------------------------
dist(rbind(v1, v2))
# v1
# v2 2.236068
sqrt(sum((v1 - v2)^2, na.rm=TRUE))
# [1] 2.236068
# But they don't agree when there is a missing component
# Under what logic does sqrt(6) make sense as the answer for dist?
# --------------------------------------------
dist(rbind(v1, v3))
# v1
# v3 2.44949
sqrt(sum((v1 - v3)^2, na.rm=TRUE))
# [1] 2
First, if we want to exclude missing values from mathematical operations use the na. rm = TRUE argument. If you do not exclude these values most functions will return an NA . We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data.
We can remove those NA values from the vector by using is.na(). is.na() is used to get the na values based on the vector index. !
Output: We have created a data frame with some missing values (NA). Step 2: Now to check the missing values we are using is.na() function in R and print out the number of missing items in the data frame as shown below.
Yes, the scaling happens exactly like you described. Maybe this is a better example:
set.seed(123)
v1 <- sample(c(1:3, NA), 100, TRUE)
v2 <- sample(c(1:3, NA), 100, TRUE)
dist(rbind(v1, v2))
# v1
# v2 12.24745
na.idx <- is.na(v1) | is.na(v2)
v1a <- v1[!na.idx]
v2a <- v2[!na.idx]
sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a))
# [1] 12.24745
The scaling makes sense to me. All things being equal, the distance increases as the number of dimensions increases. If somewhere you have a NA
for dimension i
, a reasonable guess for the contribution of dimension i
to the squared sum is the mean contribution of all other dimensions. Hence the linear up-scaling.
While you are suggesting that when you find a NA
for dimension i
, that dimension should not contribute to the squared sum. It is like assuming that v1[i] == v2[i]
which is totally different.
To summarize dist
is doing some type of maximum-likelihood estimation, while your suggestion is more like a worst (or best) case scenario.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With