Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr join define NA values

Can I define a "fill" value for NA in dplyr join? For example in the join define that all NA values should be 1?

require(dplyr) lookup <- data.frame(cbind(c("USD","MYR"),c(0.9,1.1))) names(lookup) <- c("rate","value") fx <- data.frame(c("USD","MYR","USD","MYR","XXX","YYY")) names(fx)[1] <- "rate" left_join(x=fx,y=lookup,by=c("rate")) 

Above code will create NA for values "XXX" and "YYY". In my case I am joining a large number of columns and there will be a lot of non-matches. All non-matches should have the same value. I know I can do it in several steps but the question is can all be done in one? Thanks!

like image 857
Triamus Avatar asked Mar 11 '15 16:03

Triamus


People also ask

When performing a left join Right join or full join How does dplyr handle unmatched values?

The beauty of dplyr is that it handles four types of joins similar to SQL: left_join() – To merge two datasets and keep all observations from the origin table. right_join() – To merge two datasets and keep all observations from the destination table. inner_join() – To merge two datasets and exclude all unmatched rows.

How do I replace Na with 0 in a column in R?

You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.

How do I join dplyr?

To join by different variables on x and y , use a named vector. For example, by = c("a" = "b") will match x$a to y$b . To join by multiple variables, use a vector with length > 1. For example, by = c("a", "b") will match x$a to y$a and x$b to y$b .

How do I replace missing values in a column in R?

That means if we have a column which has some missing values then replace it with the mean of the remaining values. In R, we can do this by replacing the column with missing values using mean of that column and passing na. rm = TRUE argument along with the same.


1 Answers

First off, I would like to recommend not to use the combination data.frame(cbind(...)). Here's why: cbind creates a matrix by default if you only pass atomic vectors to it. And matrices in R can only have one type of data (think of matrices as a vector with dimension attribute, i.e. number of rows and columns). Therefore, your code

cbind(c("USD","MYR"),c(0.9,1.1)) 

creates a character matrix:

str(cbind(c("USD","MYR"),c(0.9,1.1))) # chr [1:2, 1:2] "USD" "MYR" "0.9" "1.1" 

although you probably expected a final data frame with a character or factor column (rate) and a numeric column (value). But what you get is:

str(data.frame(cbind(c("USD","MYR"),c(0.9,1.1)))) #'data.frame':  2 obs. of  2 variables: # $ X1: Factor w/ 2 levels "MYR","USD": 2 1 # $ X2: Factor w/ 2 levels "0.9","1.1": 1 2 

because strings (characters) are converted to factors when using data.frame by default (You can circumvent this by specifying stringsAsFactors = FALSE in the data.frame() call).

I suggest the following alternative approach to create the sample data (also note that you can easily specify the column names in the same call):

lookup <- data.frame(rate = c("USD","MYR"),                       value = c(0.9,1.1))  fx <- data.frame(rate = c("USD","MYR","USD","MYR","XXX","YYY")) 

Now, for you actual question, if I understand correctly, you want to replace all NAs with a 1 in the joined data. If that's correct, here's a custom function using left_join and mutate_each to do that:

library(dplyr) left_join_NA <- function(x, y, ...) {   left_join(x = x, y = y, by = ...) %>%      mutate_each(funs(replace(., which(is.na(.)), 1))) } 

Now you can apply it to your data like this:

> left_join_NA(x = fx, y = lookup, by = "rate") #  rate value #1  USD   0.9 #2  MYR   1.1 #3  USD   0.9 #4  MYR   1.1 #5  XXX   1.0 #6  YYY   1.0 #Warning message: #joining factors with different levels, coercing to character vector  

Note that you end up with a character column (rate) and a numeric column (value) and all NAs are replaced by 1.

str(left_join_NA(x = fx, y = lookup, by = "rate")) #'data.frame':  6 obs. of  2 variables: # $ rate : chr  "USD" "MYR" "USD" "MYR" ... # $ value: num  0.9 1.1 0.9 1.1 1 1 
like image 167
talat Avatar answered Sep 29 '22 12:09

talat