Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr: want to add only 1 y column during a join, not all y columns

Tags:

r

dplyr

I need to join a table with itself in order to realize inheritance of a value in one column, as follows: There are two types of rows, base and dep (for "dependent"). All rows have a key, but dep rows also have a basekey referring to a base row. There is a column val and any number of other columns.

My goal: Obtain all dep rows, with their val replaced by the val of the corresponding base row.

Here is an example:

library(tidyverse)

tb = tribble(
    ~key, ~type, ~basekey, ~val, ~col5, ~colN,
    "a", "base", NA,       17,   1,      100,
    "b", "base", NA,       21,   4,      130,
    "d", "dep",  "a",      NA,   11,     320,
    "g", "dep",  "a",      NA,   14,     420,
    "n", "dep",  "a",      NA,   13,     230,
    "y", "dep",  "b",      NA,   66,     665
)

My actual data has a few hundred rows and N=20 columns, so runtime or memory performance is not an issue, but operations requiring a complete column list should be avoided.

sqldf

Previously, I used the sqldf package, which can express my requirement nicely:

sqldf("select tb.*, tb2.val as baseval 
       from tb, tb as tb2 
       where tb2.key = tb.basekey")

The result is almost exactly what I want; I only need to map baseval to val or live with the longer name (which is OK for me):

  key type basekey val col5 colN baseval
1   d  dep       a  NA   11  320      17
2   g  dep       a  NA   14  420      17
3   n  dep       a  NA   13  230      17
4   y  dep       b  NA   66  665      21

dplyr

But then I started learning ggplot2 and encountered the tidyverse. After some reading, I decided to throw myself into its arms and rework my code according to its readable and tidy style.

readr and tibble worked like a charm, but with dplyr I am not yet entirely happy; my use case appears to be one that is not supported well (maybe it's just that my thinking is not taking the best route yet).

I need to do a left join, but although all I want is the val column, there appears to be no method (as of dplyr 0.7.4) for suppressing the others:

tbj = left_join(filter(tb, type=='dep'), tb, by=c(basekey='key'))

which gives me this:

# A tibble: 4 x 11
    key type.x basekey val.x col5.x colN.x type.y basekey.y val.y col5.y colN.y
  <chr>  <chr>   <chr> <dbl>  <dbl>  <dbl>  <chr>     <chr> <dbl>  <dbl>  <dbl>
1     d    dep       a    NA     11    320   base      <NA>    17      1    100
2     g    dep       a    NA     14    420   base      <NA>    17      1    100
3     n    dep       a    NA     13    230   base      <NA>    17      1    100
4     y    dep       b    NA     66    665   base      <NA>    21      4    130

Whew! To get what I want, I now need to

  • Map val.y to val.
  • Kick out all the other *.y columns
  • Kick out val.x
  • Rename the other *.x columns to remove the suffix

This sounds laborious. How to do it best? Is this even the right approach or is there a more dplyr-ish way to achieve my goal?

like image 533
Lutz Prechelt Avatar asked Oct 15 '25 15:10

Lutz Prechelt


2 Answers

If I understand your question, how about the following approach with dplyr: Remove val from the dep subset, since it's empty anyway. Then, keep only key and val in the base subset, rename key to basekey and join. Then post-join renaming isn't needed and you don't get any extra columns.

tb %>% 
  filter(type=="dep") %>% select(-val) %>% 
  left_join(tb %>% filter(type=="base") %>% select(basekey=key, val))
    key  type basekey  col5  colN   val
  <chr> <chr>   <chr> <dbl> <dbl> <dbl>
1     d   dep       a    11   320    17
2     g   dep       a    14   420    17
3     n   dep       a    13   230    17
4     y   dep       b    66   665    21
like image 52
eipi10 Avatar answered Oct 18 '25 06:10

eipi10


First, you join the two tables and then select the columns needed. You can take advantage of the argument suffix by leaving the column names of the first table unchanged. Then selecting all columns from the first table and adding the needed column from the second table: val.y.

  • All columns from the first table plus val.y from the second.
  • tb %>% filter(type == "dep") %>% 
      left_join(tb %>% filter(type == "base"), 
                by = c("basekey" = "key"), 
                suffix = c("", ".y")) %>% 
      select(names(tb), "val.y")
    

    Output:

    # A tibble: 4 x 7
        key  type basekey   val  col5  colN val.y
      <chr> <chr>   <chr> <dbl> <dbl> <dbl> <dbl>
    1     d   dep       a    NA    11   320    17
    2     g   dep       a    NA    14   420    17
    3     n   dep       a    NA    13   230    17
    4     y   dep       b    NA    66   665    21
    

  • All columns but column val from the first table , plus val.y from the second.
  • :

    tb %>% filter(type == "dep") %>% 
      left_join(tb %>% filter(type == "base"), 
                by = c("basekey" = "key"), 
                suffix = c("", ".y")) %>% 
      select(names(tb), -val, "val.y")
    

    Output:

    # A tibble: 4 x 6
        key  type basekey  col5  colN val.y
      <chr> <chr>   <chr> <dbl> <dbl> <dbl>
    1     d   dep       a    11   320    17
    2     g   dep       a    14   420    17
    3     n   dep       a    13   230    17
    4     y   dep       b    66   665    21
    
    like image 30
    mpalanco Avatar answered Oct 18 '25 06:10

    mpalanco