I have one data table with the following columns:
name, x, y
a, 1, 2
b, 2, 3
c, 3, 1
I want to join this table with itself, keeping every row where name != name and run a distance function on the x and y values from each side. The result should be in the format:
name1, name2, distance
I wrote the distance function like this:
dist <- function(a, b) sqrt((a$x-b$x)^2 + (a$y-b$y)^2)
I tried to use the outer function, but it only takes vectors, not data tables and I tried using the various joins in dplyr but was unsuccessful.
Suppose your dataset looked like:
d <- data_frame(name = rownames(mtcars), x = mtcars$mpg, y = mtcars$cyl)
A general way to try all combinations of two data frames (or all combinations with themselves) is tidyr's crossing function (though you'll need to be careful about renaming the columns). Afterwards you'll be able to compute the distance and do your filtering:
library(dplyr)
library(tidyr)
d %>%
rename(name1 = name, x1 = x, y1 = y) %>%
crossing(d) %>%
rename(name2 = name, x2 = x, y2 = y) %>%
mutate(distance = sqrt((x1 - x2) ^ 2 + (y1 - y2) ^ 2)) %>%
filter(name1 != name2)
In this particular case, you could use my fuzzyjoin package, specifically distance_join (you'll need the latest development version from GitHub). This joins two data frames (in this case, a self-join) based on a distance threshold, and appends an additional column with the distance:
library(fuzzyjoin)
d %>%
rename(name1 = name) %>%
distance_inner_join(d, max_dist = Inf, distance_col = "distance") %>%
rename(name2 = name) %>%
filter(name1 != name2)
This will give:
# A tibble: 992 x 7
name1 x.x y.x name2 x.y y.y distance
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 Mazda RX4 Wag 21.0 6 0.000000
2 Mazda RX4 21 6 Datsun 710 22.8 4 2.690725
3 Mazda RX4 21 6 Hornet 4 Drive 21.4 6 0.400000
4 Mazda RX4 21 6 Hornet Sportabout 18.7 8 3.047950
5 Mazda RX4 21 6 Valiant 18.1 6 2.900000
6 Mazda RX4 21 6 Duster 360 14.3 8 6.992138
7 Mazda RX4 21 6 Merc 240D 24.4 4 3.944617
8 Mazda RX4 21 6 Merc 230 22.8 4 2.690725
9 Mazda RX4 21 6 Merc 280 19.2 6 1.800000
10 Mazda RX4 21 6 Merc 280C 17.8 6 3.200000
# ... with 982 more rows
You could set max_dist to another, non-infinite threshold if you know you don't care about distant matches.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With