Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data table outer join by function

Tags:

r

I have one data table with the following columns:

name, x, y
a, 1, 2
b, 2, 3
c, 3, 1

I want to join this table with itself, keeping every row where name != name and run a distance function on the x and y values from each side. The result should be in the format:

name1, name2, distance

I wrote the distance function like this:

dist <- function(a, b) sqrt((a$x-b$x)^2 + (a$y-b$y)^2)

I tried to use the outer function, but it only takes vectors, not data tables and I tried using the various joins in dplyr but was unsuccessful.

like image 436
Nick Larsen Avatar asked Feb 16 '26 14:02

Nick Larsen


1 Answers

Suppose your dataset looked like:

d <- data_frame(name = rownames(mtcars), x = mtcars$mpg, y = mtcars$cyl)

A general way to try all combinations of two data frames (or all combinations with themselves) is tidyr's crossing function (though you'll need to be careful about renaming the columns). Afterwards you'll be able to compute the distance and do your filtering:

library(dplyr)
library(tidyr)

d %>%
  rename(name1 = name, x1 = x, y1 = y) %>%
  crossing(d) %>%
  rename(name2 = name, x2 = x, y2 = y) %>%
  mutate(distance = sqrt((x1 - x2) ^ 2 + (y1 - y2) ^ 2)) %>%
  filter(name1 != name2)

In this particular case, you could use my fuzzyjoin package, specifically distance_join (you'll need the latest development version from GitHub). This joins two data frames (in this case, a self-join) based on a distance threshold, and appends an additional column with the distance:

library(fuzzyjoin)

d %>%
  rename(name1 = name) %>%
  distance_inner_join(d, max_dist = Inf, distance_col = "distance") %>%
  rename(name2 = name) %>%
  filter(name1 != name2)

This will give:

# A tibble: 992 x 7
       name1   x.x   y.x             name2   x.y   y.y distance
       <chr> <dbl> <dbl>             <chr> <dbl> <dbl>    <dbl>
1  Mazda RX4    21     6     Mazda RX4 Wag  21.0     6 0.000000
2  Mazda RX4    21     6        Datsun 710  22.8     4 2.690725
3  Mazda RX4    21     6    Hornet 4 Drive  21.4     6 0.400000
4  Mazda RX4    21     6 Hornet Sportabout  18.7     8 3.047950
5  Mazda RX4    21     6           Valiant  18.1     6 2.900000
6  Mazda RX4    21     6        Duster 360  14.3     8 6.992138
7  Mazda RX4    21     6         Merc 240D  24.4     4 3.944617
8  Mazda RX4    21     6          Merc 230  22.8     4 2.690725
9  Mazda RX4    21     6          Merc 280  19.2     6 1.800000
10 Mazda RX4    21     6         Merc 280C  17.8     6 3.200000
# ... with 982 more rows

You could set max_dist to another, non-infinite threshold if you know you don't care about distant matches.

like image 155
David Robinson Avatar answered Feb 19 '26 05:02

David Robinson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!