In a large dataframe ("myfile") with four columns I have to add a fifth column with values conditionally based on the first four columns.
Prefer answers with dplyr
and mutate
, mainly because of its speed in large datasets.
My dataframe looks like this:
V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4 ...
The values of the fifth column (V5) are based on some conditional rules:
if (V1==1 & V2!=4) { V5 <- 1 } else if (V2==4 & V3!=1) { V5 <- 2 } else { V5 <- 0 }
Now I want to use the mutate
function to use these rules on all rows (to avoid slow loops). Something like this (and yes, I know it doesn't work this way!):
myfile <- mutate(myfile, if (V1==1 & V2!=4){V5 = 1} else if (V2==4 & V3!=1){V5 = 2} else {V5 = 0})
This should be the result:
V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0
How to do this in dplyr
?
Use mutate() and its other verbs mutate_all() , mutate_if() and mutate_at() from dplyr package to replace/update the values of the column (string, integer, or any type) in R DataFrame (data. frame).
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name.
Often you may want to create a new variable in a data frame in R based on some condition. Fortunately this is easy to do using the mutate() and case_when() functions from the dplyr package.
Try this:
myfile %>% mutate(V5 = (V1 == 1 & V2 != 4) + 2 * (V2 == 4 & V3 != 1))
giving:
V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0
or this:
myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, ifelse(V2 == 4 & V3 != 1, 2, 0)))
giving:
V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0
Suggest you get a better name for your data frame. myfile makes it seem as if it holds a file name.
Above used this input:
myfile <- structure(list(V1 = c(1L, 2L, 1L, 4L, 5L), V2 = c(2L, 4L, 4L, 5L, 5L), V3 = c(3L, 4L, 1L, 1L, 5L), V4 = c(5L, 1L, 1L, 3L, 4L )), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
Update 1 Since originally posted dplyr has changed %.%
to %>%
so have modified answer accordingly.
Update 2 dplyr now has case_when
which provides another solution:
myfile %>% mutate(V5 = case_when(V1 == 1 & V2 != 4 ~ 1, V2 == 4 & V3 != 1 ~ 2, TRUE ~ 0))
With dplyr 0.7.2
, you can use the very useful case_when
function :
x=read.table( text="V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4") x$V5 = case_when(x$V1==1 & x$V2!=4 ~ 1, x$V2==4 & x$V3!=1 ~ 2, TRUE ~ 0)
Expressed with dplyr::mutate
, it gives:
x = x %>% mutate( V5 = case_when( V1==1 & V2!=4 ~ 1, V2==4 & V3!=1 ~ 2, TRUE ~ 0 ) )
Please note that NA
are not treated specially, as it can be misleading. The function will return NA
only when no condition is matched. If you put a line with TRUE ~ ...
, like I did in my example, the return value will then never be NA
.
Therefore, you have to expressively tell case_when
to put NA
where it belongs by adding a statement like is.na(x$V1) | is.na(x$V3) ~ NA_integer_
. Hint: the dplyr::coalesce()
function can be really useful here sometimes!
Moreover, please note that NA
alone will usually not work, you have to put special NA
values : NA_integer_
, NA_character_
or NA_real_
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With