Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a clean dplyr-way of doing multiple left-(self)joins?

Tags:

r

dplyr

I have the following, working, code:

test_hierarchie <- tribble(~child, ~parent,
                "A", "B",
                "B", "C",
                "D", "E"
                )

test_hierarchie_transformed <- test_hierarchie %>%
    left_join(test_hierarchie, by = c("parent" = "child"), suffix = c("", "_grant")) %>%
    left_join(test_hierarchie, by = c("parent_grant" = "child"), suffix = c("", "_grant")) %>%
    left_join(test_hierarchie, by = c("parent_grant_grant" = "child"), suffix = c("", "_grant")) %>%
    left_join(test_hierarchie, by = c("parent_grant_grant_grant" = "child"), suffix = c("", "_grant")) %>%
    left_join(test_hierarchie, by = c("parent_grant_grant_grant_grant" = "child"), suffix = c("", "_grant")) %>%
    pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
    filter(!is.na(parent))

With result:

# A tibble: 4 x 3
  child relation     parent
  <chr> <chr>        <chr> 
1 A     parent       B     
2 A     parent_grant C     
3 B     parent       C     
4 D     parent       E 

This is the desired result, the large amount of left_joins are there because I'm for the real data not sure what is the maximum hierarchy.

My question is: is there a way to do this more succinct and dynamic? Thanks!

EDIT 1: Yes, I do mean 'grand' instead of 'grant', haha EDIT 2: Great solution, exactly what I was looking for! Thanks everyone for pitching in, the other day I was thinking about another project and iGraph does seem very helpful for that.

like image 960
CorneeldH Avatar asked Nov 24 '21 08:11

CorneeldH


People also ask

How do I left join multiple files in R?

The fastest and easiest way to perform multiple left joins in R is by using reduce function from purrr package and, of course, left_join from dplyr. If you have to combine only a few data sets, then other solutions may be nested left_join functions from the dplyr package.

Why do I have more rows after left join in R?

More rows may also appear if you have NA values in both A 's and B 's names on which you join. So make sure you exclude those.

What is a self join in R?

A self-join, also known as an inner join, is a structured query language (SQL) statement where a queried table is joined to itself. The self-join statement is necessary when two sets of data, within the same table, are compared.

How do I join the Dplyr?

Joins with dplyr. dplyr uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge() ) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y) .

What is the difference between left_join and right_join in dplyr?

Figure 3: dplyr left_join Function. The difference to the inner_join function is that left_join retains all rows of the data table, which is inserted first into the function (i.e. the X-data). Have a look at the R documentation for a precise definition: Example 3: right_join dplyr R Function. Right join is the reversed brother of left join:

How does anti join work in dplyr?

Figure 7: dplyr anti_join Function. As you can see, the anti_join functions keeps only rows that are non-existent in the right-hand data AND keeps only columns of the left-hand data. The R help documentation of anti join is shown below: At this point you have learned the basic principles of the six dplyr join functions.

How to merge data based on inner_join in dplyr?

In order to merge our data based on inner_join, we simply have to specify the names of our two data frames (i.e. data1 and data2) and the column based on which we want to merge (i.e. the column ID ): Figure 2: dplyr inner_join Function. Figure 2 illustrates the output of the inner join that we have just performed.

How do I join multiple DataFrames in R using dplyr?

How do I join multiple dataframes in R using dplyr ? this is the code I am using to left join x and y the code doesn't work for multiple joins This is how you join multiple data sets in R usually. You can use left_join instead of merge if you like. Use Reduce (function (dtf1,dtf2) left_join (dtf1,dtf2,by="index"), list (x,y,z)).


Video Answer


2 Answers

Following the suggestion by @zx8754 one option to achieve your desired result would be to do the left_joins via a recursive function which stops when there are no more matches:

library(dplyr)
library(tidyr)

test_hierarchie <- tribble(
  ~child, ~parent,
  "A", "B",
  "B", "C",
  "D", "E"
)

left_join_recursive <- function(x, by) {
  x <- left_join(x, test_hierarchie, by = setNames("child", by), suffix = c("", "_grant"))
  byby <- paste0(by, "_grant")
  if (!all(is.na(x[[byby]]))) {
    left_join_recursive(x, byby)  
  } else {
    x
  }
}

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
  pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
  filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 4 × 3
#>   child relation     parent
#>   <chr> <chr>        <chr> 
#> 1 A     parent       B     
#> 2 A     parent_grant C     
#> 3 B     parent       C     
#> 4 D     parent       E

To check wether the approach works in a more general case I added another row to your example data:

test_hierarchie <- add_row(test_hierarchie, child = "C", parent = "D")

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
  pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
  filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 10 × 3
#>    child relation                 parent
#>    <chr> <chr>                    <chr> 
#>  1 A     parent                   B     
#>  2 A     parent_grant             C     
#>  3 A     parent_grant_grant       D     
#>  4 A     parent_grant_grant_grant E     
#>  5 B     parent                   C     
#>  6 B     parent_grant             D     
#>  7 B     parent_grant_grant       E     
#>  8 D     parent                   E     
#>  9 C     parent                   D     
#> 10 C     parent_grant             E
like image 130
stefan Avatar answered Apr 01 '23 06:04

stefan


As was mentioned you can use the igraph package, but it probably only pays off for more complex cases:

library(tidyverse)
library(igraph)

test_hierarchie <- tribble(~child, ~parent,
                           "A", "B",
                           "B", "C",
                           "D", "E"
)

g <- graph_from_data_frame(test_hierarchie)
finals <- V(g)[degree(g, mode = "out") == 0]
starts <- V(g)[!V(g) %in% finals]
#starts <- V(g)[degree(g, mode = "in") == 0] # use this to avoid sub-paths
imap_dfr(starts, 
         ~enframe(all_simple_paths(g, from = starts[[.y]], to = finals)[[1]],
                  name = "parent") %>%
           mutate(child = .y)) %>%
  filter(child != parent) %>%
  select(-value) %>%
  group_by(child) %>%
  mutate(nr = row_number() - 1) %>%
  ungroup() %>%
  mutate(relation = map_chr(nr, ~str_c("parent", str_c(rep("_grant", .x), collapse = "")))) %>%
  select(child, relation, parent)

# # A tibble: 4 x 3
# child relation     parent
# <chr> <chr>        <chr> 
# 1 A     parent       B     
# 2 A     parent_grant C     
# 3 B     parent       C     
# 4 D     parent       E   
like image 42
r.user.05apr Avatar answered Apr 01 '23 06:04

r.user.05apr