Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging data using left_join based on a condition

Tags:

r

I have two data frames I am trying to merge.

"data" looks like:

    Filled_Ticker2LP publishYear              CO_1_Name
 1:             SONC        2005             sonic corp
 2:             SONC        2005             sonic corp
 3:             <NA>        2005   cascade bancorp inc.
 4:              JCP        2005  jc penney company inc

"comp" looks like:

   tic   fyear conm           
   <chr> <int> <chr>          
 1 JCP    2004 penney (j c) co
 2 JCP    2005 penney (j c) co
 3 JCP    2006 penney (j c) co
 4 JCP    2007 penney (j c) co
 5 JCP    2008 penney (j c) co

I want to join these two data sets together using left_join (or something from the data.table package etc.)

I can currently join it based on the year and symbol such as SONC, JCP.

mergedData <- data %>% 
  left_join(comp, by = c("Filled_Ticker2LP" = "tic", "publishYear" = "fyear"))

The "mergedData" looks like:

    Filled_Ticker2LP publishYear              CO_1_Name            conm
 1:             SONC        2005             sonic corp      sonic corp
 2:             SONC        2005             sonic corp      sonic corp
 3:             <NA>        2005   cascade bancorp inc.            <NA>
 4:              JCP        2005  jc penney company inc penney (j c) co

Which Works well but it has an NA in the Filled_Ticker2LP column (which comes from the data data set.

I would like to try and join the data using the current method that I have but if there is an NA value in the column Filled_Ticker2LP I would like to change the "matching link" from joining on tic and Filled_Ticker2LP to matching on company name. conm or CO_1_Name.

That is, the data fails to join currently because in the data data set under the column Filled_Ticker2LP for observation 3 it has an NA value. However, this observation can still be joined with the comp data since in the column data$CO_1_Name observation 3 has cascade Bancorp. This result also appears in the comp data from observation 30 - 53 under the column conm.

I was thinking of an if statement that:

if not NA in data$Filled_Ticker2LP then join using Filled_Ticker2LP and tic else join using CO_1_Name and conm.

Additional

I also noticed that there is some whitespace in the columns.

data %>% 
  mutate(
    CO_1_Name = str_trim(CO_1_Name)
  )

Data 1:

comp <- structure(list(tic = c("JCP", "JCP", "JCP", "JCP", "JCP", "JCP", 
"JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", 
"SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "SONC", 
"SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "CACB", "CACB", 
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", 
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", 
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB"), fyear = c(2004L, 
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L, 2005L, 2006L, 2007L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2005L, 2005L, 2006L, 2006L, 2007L, 2007L, 2008L, 2008L, 
2009L, 2009L, 2010L, 2010L, 2011L, 2011L, 2012L, 2012L, 2013L, 
2013L, 2014L, 2014L, 2015L, 2015L, 2016L, 2016L), conm = c("penney (j c) co", 
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co", 
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co", 
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co", 
"penney (j c) co", "penney (j c) co", "sonic corp", "sonic corp", 
"sonic corp", "sonic corp", "sonic corp", "sonic corp", "sonic corp", 
"sonic corp", "sonic corp", "sonic corp", "sonic corp", "sonic corp", 
"sonic corp", "sonic corp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp", 
"cascade bancorp", "cascade bancorp")), row.names = c(NA, -53L
), class = c("tbl_df", "tbl", "data.frame"))

Data 2:

data <- structure(list(Filled_Ticker2LP = c("SONC", "SONC", NA, "JCP", 
"JCP", "JCP", "SONC", "SONC", "JCP", "JCP", "JCP", "JCP", "SONC", 
"JCP", "JCP", "JCP", "SONC", "JCP", "JCP", "SONC", "JCP", "JCP", 
"JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", 
"JCP", "JCP", "JCP", "JCP", "SONC"), publishYear = c(2005L, 2005L, 
2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L, 
2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 
2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 
2011L, 2011L, 2012L, 2013L, 2015L, 2015L, 2016L), CO_1_Name = c(" sonic corp", 
" sonic corp", " cascade bancorp inc.", " jc penney company inc", 
" jc penney company inc", " jc penney company inc", " sonic corp", 
" sonic corp", " jc penney company inc", " jc penney company inc", 
" jc penney company inc", " jc penney company inc", " sonic corp", 
" jc penney company inc", " jc penney company inc", " jc penney company inc", 
" sonic corp", " jc penney company inc", " jc penney company inc", 
" sonic corp", " jc penney company inc", " jc penney company inc", 
" jc penney company inc", " macy's incorporated", " macy's incorporated", 
" jc penney company inc", " macy's incorporated", " macy's incorporated", 
" jc penney company inc", " apple inc", " apple inc", " macy's incorporated", 
" jc penney company inc", " jc penney company inc", " jc penney company inc", 
" sonic corp")), .internal.selfref = <pointer: 0x55603dbefe00>, row.names = c(NA, 
-36L), class = c("data.table", "data.frame"))
like image 749
user8959427 Avatar asked Dec 01 '25 03:12

user8959427


1 Answers

If we need to join by a condition, one option is to split based on the occurrence of NA in the 'data' for that specific column, then use map2 to do the join

library(dplyr)
library(purrr)
library(stringr)
data %>% 
   mutate(
     CO_1_Name = str_trim(CO_1_Name)
  ) %>% 
   group_split(grp = is.na(Filled_Ticker2LP), keep = FALSE) %>% 
   map2_dfr(list(setNames('tic', "Filled_Ticker2LP"), 
             setNames( "conm", "CO_1_Name")), ~
    left_join(.x, comp, by = c(.y, "publishYear" = "fyear")))
like image 193
akrun Avatar answered Dec 02 '25 18:12

akrun