I would like to join two data frames. Some of the column names overlap, and there are NA
entries in one of the data frame's overlapping columns. Here is a simplified example:
df1 <- data.frame(fruit = c('apples','oranges','bananas','grapes'), var1 = c(1,2,3,4), var2 = c(3,NA,6,NA), stringsAsFactors = FALSE)
df2 <- data.frame(fruit = c('oranges','grapes'), var2=c(5,6), var3=c(7,8), stringsAsFactors = FALSE)
Can I use dplyr join functions to join these data frames and automatically prioritize the non-NA
entry so that I get the "var2" column to have no NA
entries in the joined data frame? As it is now, if I call left_join
, it keeps the NA
entries, and if I call full_join
it duplicates the rows.
Example Data
> df1
fruit var1 var2
1 apples 1 3
2 oranges 2 NA
3 bananas 3 6
4 grapes 4 NA
> df2
fruit var2 var3
1 oranges 5 7
2 grapes 6 8
Replace NA with 0 in R Data Frame To replace NA with 0 in an R data frame, use is.na() function and then select all those values with NA and assign them to 0. myDataframe is the data frame in which you would like replace all NAs with 0.
You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
All of the dplyr functions take a data frame (or tibble) as the first argument. Rather than forcing the user to either save intermediate objects or nest functions, dplyr provides the %>% operator from magrittr.
In R we use merge() function to merge two dataframes in R. This function is present inside join() function of dplyr package. The most important condition for joining two dataframes is that the column type should be the same on which the merging happens. merge() function works similarly like join in DBMS.
coalesce
might be something you need. It fills the NA from the first vector with values from the second vector at corresponding positions:
library(dplyr)
df1 %>%
left_join(df2, by = "fruit") %>%
mutate(var2 = coalesce(var2.x, var2.y)) %>%
select(-var2.x, -var2.y)
# fruit var1 var3 var2
# 1 apples 1 NA 3
# 2 oranges 2 7 5
# 3 bananas 3 NA 6
# 4 grapes 4 8 6
Or use data.table
, which does in-place replacing:
library(data.table)
setDT(df1)[setDT(df2), on = "fruit", `:=` (var2 = i.var2, var3 = i.var3)]
df1
# fruit var1 var2 var3
# 1: apples 1 3 NA
# 2: oranges 2 5 7
# 3: bananas 3 6 NA
# 4: grapes 4 6 8
Using purrr along with dplyr might be solution to apply with multiple columns:
library(purrr)
library(dplyr)
df<-left_join(df1,df2,by="fruit")
map2_dfr(df[3],df[4],~ifelse(is.na(.x),.y,.x)) %>%
bind_cols(df[c(1,2,5)],.)
fruit var1 var3 var2.x
1 apples 1 NA 3
2 oranges 2 7 5
3 bananas 3 NA 6
4 grapes 4 8 6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With