I have two data frames I am trying to merge.
"data" looks like:
Filled_Ticker2LP publishYear CO_1_Name
1: SONC 2005 sonic corp
2: SONC 2005 sonic corp
3: <NA> 2005 cascade bancorp inc.
4: JCP 2005 jc penney company inc
"comp" looks like:
tic fyear conm
<chr> <int> <chr>
1 JCP 2004 penney (j c) co
2 JCP 2005 penney (j c) co
3 JCP 2006 penney (j c) co
4 JCP 2007 penney (j c) co
5 JCP 2008 penney (j c) co
I want to join these two data sets together using left_join (or something from the data.table package etc.)
I can currently join it based on the year and symbol such as SONC, JCP.
mergedData <- data %>%
left_join(comp, by = c("Filled_Ticker2LP" = "tic", "publishYear" = "fyear"))
The "mergedData" looks like:
Filled_Ticker2LP publishYear CO_1_Name conm
1: SONC 2005 sonic corp sonic corp
2: SONC 2005 sonic corp sonic corp
3: <NA> 2005 cascade bancorp inc. <NA>
4: JCP 2005 jc penney company inc penney (j c) co
Which Works well but it has an NA in the Filled_Ticker2LP column (which comes from the data data set.
I would like to try and join the data using the current method that I have but if there is an NA value in the column Filled_Ticker2LP I would like to change the "matching link" from joining on tic and Filled_Ticker2LP to matching on company name. conm or CO_1_Name.
That is, the data fails to join currently because in the data data set under the column Filled_Ticker2LP for observation 3 it has an NA value. However, this observation can still be joined with the comp data since in the column data$CO_1_Name observation 3 has cascade Bancorp. This result also appears in the comp data from observation 30 - 53 under the column conm.
I was thinking of an if statement that:
if not NA in data$Filled_Ticker2LP then join using Filled_Ticker2LP and tic else join using CO_1_Name and conm.
Additional
I also noticed that there is some whitespace in the columns.
data %>%
mutate(
CO_1_Name = str_trim(CO_1_Name)
)
Data 1:
comp <- structure(list(tic = c("JCP", "JCP", "JCP", "JCP", "JCP", "JCP",
"JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP",
"SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "SONC",
"SONC", "SONC", "SONC", "SONC", "SONC", "SONC", "CACB", "CACB",
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB",
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB", "CACB",
"CACB", "CACB", "CACB", "CACB", "CACB", "CACB"), fyear = c(2004L,
2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L,
2014L, 2015L, 2016L, 2017L, 2018L, 2005L, 2006L, 2007L, 2008L,
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L,
2018L, 2005L, 2005L, 2006L, 2006L, 2007L, 2007L, 2008L, 2008L,
2009L, 2009L, 2010L, 2010L, 2011L, 2011L, 2012L, 2012L, 2013L,
2013L, 2014L, 2014L, 2015L, 2015L, 2016L, 2016L), conm = c("penney (j c) co",
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co",
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co",
"penney (j c) co", "penney (j c) co", "penney (j c) co", "penney (j c) co",
"penney (j c) co", "penney (j c) co", "sonic corp", "sonic corp",
"sonic corp", "sonic corp", "sonic corp", "sonic corp", "sonic corp",
"sonic corp", "sonic corp", "sonic corp", "sonic corp", "sonic corp",
"sonic corp", "sonic corp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp", "cascade bancorp", "cascade bancorp",
"cascade bancorp", "cascade bancorp")), row.names = c(NA, -53L
), class = c("tbl_df", "tbl", "data.frame"))
Data 2:
data <- structure(list(Filled_Ticker2LP = c("SONC", "SONC", NA, "JCP",
"JCP", "JCP", "SONC", "SONC", "JCP", "JCP", "JCP", "JCP", "SONC",
"JCP", "JCP", "JCP", "SONC", "JCP", "JCP", "SONC", "JCP", "JCP",
"JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP", "JCP",
"JCP", "JCP", "JCP", "JCP", "SONC"), publishYear = c(2005L, 2005L,
2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L, 2006L,
2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2009L, 2009L, 2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 2011L,
2011L, 2011L, 2012L, 2013L, 2015L, 2015L, 2016L), CO_1_Name = c(" sonic corp",
" sonic corp", " cascade bancorp inc.", " jc penney company inc",
" jc penney company inc", " jc penney company inc", " sonic corp",
" sonic corp", " jc penney company inc", " jc penney company inc",
" jc penney company inc", " jc penney company inc", " sonic corp",
" jc penney company inc", " jc penney company inc", " jc penney company inc",
" sonic corp", " jc penney company inc", " jc penney company inc",
" sonic corp", " jc penney company inc", " jc penney company inc",
" jc penney company inc", " macy's incorporated", " macy's incorporated",
" jc penney company inc", " macy's incorporated", " macy's incorporated",
" jc penney company inc", " apple inc", " apple inc", " macy's incorporated",
" jc penney company inc", " jc penney company inc", " jc penney company inc",
" sonic corp")), .internal.selfref = <pointer: 0x55603dbefe00>, row.names = c(NA,
-36L), class = c("data.table", "data.frame"))
If we need to join by a condition, one option is to split based on the occurrence of NA in the 'data' for that specific column, then use map2 to do the join
library(dplyr)
library(purrr)
library(stringr)
data %>%
mutate(
CO_1_Name = str_trim(CO_1_Name)
) %>%
group_split(grp = is.na(Filled_Ticker2LP), keep = FALSE) %>%
map2_dfr(list(setNames('tic', "Filled_Ticker2LP"),
setNames( "conm", "CO_1_Name")), ~
left_join(.x, comp, by = c(.y, "publishYear" = "fyear")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With