In a large dataframe ("myfile") with four columns I have to add a fifth column with values conditionally based on the first four columns. Prefer answers with <code>dplyr</code> and <code>mutate</code>, mainly because of its speed in large datasets. My dataframe looks like this: <pre class="prettyprint"><code> V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4 ... </code></pre> The values of the fifth column (V5) are based on some conditional rules: <pre class="prettyprint"><code>if (V1==1 & V2!=4) { V5 <- 1 } else if (V2==4 & V3!=1) { V5 <- 2 } else { V5 <- 0 } </code></pre> Now I want to use the <code>mutate</code> function to use these rules on all rows (to avoid slow loops). Something like this (and yes, I know it doesn't work this way!): <pre class="prettyprint"><code>myfile <- mutate(myfile, if (V1==1 & V2!=4){V5 = 1} else if (V2==4 & V3!=1){V5 = 2} else {V5 = 0}) </code></pre> This should be the result: <pre class="prettyprint"><code> V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 </code></pre> How to do this in <code>dplyr</code>?

Try this: <pre class="prettyprint"><code>myfile %>% mutate(V5 = (V1 == 1 & V2 != 4) + 2 * (V2 == 4 & V3 != 1)) </code></pre> giving: <pre class="prettyprint"><code> V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 </code></pre> or this: <pre class="prettyprint"><code>myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, ifelse(V2 == 4 & V3 != 1, 2, 0))) </code></pre> giving: <pre class="prettyprint"><code> V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 </code></pre> <h3>Note</h3> Suggest you get a better name for your data frame. myfile makes it seem as if it holds a file name. Above used this input: <pre class="prettyprint"><code>myfile <- structure(list(V1 = c(1L, 2L, 1L, 4L, 5L), V2 = c(2L, 4L, 4L, 5L, 5L), V3 = c(3L, 4L, 1L, 1L, 5L), V4 = c(5L, 1L, 1L, 3L, 4L )), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c("1", "2", "3", "4", "5")) </code></pre> Update 1 Since originally posted dplyr has changed <code>%.%</code> to <code>%>%</code> so have modified answer accordingly. Update 2 dplyr now has <code>case_when</code> which provides another solution: <pre class="prettyprint"><code>myfile %>% mutate(V5 = case_when(V1 == 1 & V2 != 4 ~ 1, V2 == 4 & V3 != 1 ~ 2, TRUE ~ 0)) </code></pre>

With <code>dplyr 0.7.2</code>, you can use the very useful <code>case_when</code> function : <pre class="prettyprint"><code>x=read.table( text="V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4") x$V5 = case_when(x$V1==1 & x$V2!=4 ~ 1, x$V2==4 & x$V3!=1 ~ 2, TRUE ~ 0) </code></pre> Expressed with <code>dplyr::mutate</code>, it gives: <pre class="prettyprint"><code>x = x %>% mutate( V5 = case_when( V1==1 & V2!=4 ~ 1, V2==4 & V3!=1 ~ 2, TRUE ~ 0 ) ) </code></pre> Please note that <code>NA</code> are not treated specially, as it can be misleading. The function will return <code>NA</code> only when no condition is matched. If you put a line with <code>TRUE ~ ...</code>, like I did in my example, the return value will then never be <code>NA</code>. Therefore, you have to expressively tell <code>case_when</code> to put <code>NA</code> where it belongs by adding a statement like <code>is.na(x$V1) | is.na(x$V3) ~ NA_integer_</code>. Hint: the <code>dplyr::coalesce()</code> function can be really useful here sometimes! Moreover, please note that <code>NA</code> alone will usually not work, you have to put special <code>NA</code> values : <code>NA_integer_</code>, <code>NA_character_</code> or <code>NA_real_</code>.

dplyr mutate with conditional values

Tags:

r

dplyr

In a large dataframe ("myfile") with four columns I have to add a fifth column with values conditionally based on the first four columns.

Prefer answers with dplyr and mutate, mainly because of its speed in large datasets.

My dataframe looks like this:

  V1 V2 V3 V4 1  1  2  3  5 2  2  4  4  1 3  1  4  1  1 4  4  5  1  3 5  5  5  5  4 ...

The values of the fifth column (V5) are based on some conditional rules:

if (V1==1 & V2!=4) {   V5 <- 1 } else if (V2==4 & V3!=1) {   V5 <- 2 } else {   V5 <- 0 }

Now I want to use the mutate function to use these rules on all rows (to avoid slow loops). Something like this (and yes, I know it doesn't work this way!):

myfile <- mutate(myfile, if (V1==1 & V2!=4){V5 = 1}     else if (V2==4 & V3!=1){V5 = 2}     else {V5 = 0})

This should be the result:

  V1 V2 V3 V4 V5 1  1  2  3  5  1 2  2  4  4  1  2 3  1  4  1  1  0 4  4  5  1  3  0 5  5  5  5  4  0

How to do this in dplyr?

427

asked Mar 11 '14 21:03

rdatasculptor

2 Answers

Try this:

myfile %>% mutate(V5 = (V1 == 1 & V2 != 4) + 2 * (V2 == 4 & V3 != 1))

giving:

  V1 V2 V3 V4 V5 1  1  2  3  5  1 2  2  4  4  1  2 3  1  4  1  1  0 4  4  5  1  3  0 5  5  5  5  4  0

or this:

myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, ifelse(V2 == 4 & V3 != 1, 2, 0)))

giving:

  V1 V2 V3 V4 V5 1  1  2  3  5  1 2  2  4  4  1  2 3  1  4  1  1  0 4  4  5  1  3  0 5  5  5  5  4  0

Note

Suggest you get a better name for your data frame. myfile makes it seem as if it holds a file name.

Above used this input:

myfile <-  structure(list(V1 = c(1L, 2L, 1L, 4L, 5L), V2 = c(2L, 4L, 4L,  5L, 5L), V3 = c(3L, 4L, 1L, 1L, 5L), V4 = c(5L, 1L, 1L, 3L, 4L )), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c("1",  "2", "3", "4", "5"))

Update 1 Since originally posted dplyr has changed %.% to %>% so have modified answer accordingly.

Update 2 dplyr now has case_when which provides another solution:

myfile %>%         mutate(V5 = case_when(V1 == 1 & V2 != 4 ~ 1,                               V2 == 4 & V3 != 1 ~ 2,                              TRUE ~ 0))

answered Oct 11 '22 15:10

G. Grothendieck

With dplyr 0.7.2, you can use the very useful case_when function :

x=read.table(  text="V1 V2 V3 V4  1  1  2  3  5  2  2  4  4  1  3  1  4  1  1  4  4  5  1  3  5  5  5  5  4") x$V5 = case_when(x$V1==1 & x$V2!=4 ~ 1,                  x$V2==4 & x$V3!=1 ~ 2,                  TRUE ~ 0)

Expressed with dplyr::mutate, it gives:

x = x %>% mutate(      V5 = case_when(          V1==1 & V2!=4 ~ 1,          V2==4 & V3!=1 ~ 2,          TRUE ~ 0      ) )

Please note that NA are not treated specially, as it can be misleading. The function will return NA only when no condition is matched. If you put a line with TRUE ~ ..., like I did in my example, the return value will then never be NA.

Therefore, you have to expressively tell case_when to put NA where it belongs by adding a statement like is.na(x$V1) | is.na(x$V3) ~ NA_integer_. Hint: the dplyr::coalesce() function can be really useful here sometimes!

Moreover, please note that NA alone will usually not work, you have to put special NA values : NA_integer_, NA_character_ or NA_real_.

answered Oct 11 '22 15:10

Dan Chaltiel

Related questions
                            
                                Convert a vector into a list, each element in the vector as an element in the list
                            
                                Remove facet_wrap labels completely
                            
                                R spreading multiple columns with tidyr [duplicate]
                            
                                How do you specifically order ggplot2 x axis instead of alphabetical order? [duplicate]
                            
                                Suppress output of a function
                            
                                Converting year and month ("yyyy-mm" format) to a date?
                            
                                Fitting a density curve to a histogram in R
                            
                                Working with dictionaries/lists in R
                            
                                In R, how to find the standard error of the mean?
                            
                                ggplot2 plot area margins?
                            
                                Read all files in a folder and apply a function to each data frame
                            
                                How to deal with "data of class uneval" error from ggplot2?
                            
                                Gantt charts with R [closed]
                            
                                Intelligent point label placement in R
                            
                                Add (insert) a column between two columns in a data.frame
                            
                                How do I extract just the number from a named number (without the name)?
                            
                                Does R have an assert statement as in python?
                            
                                How to left align text in annotate from ggplot2
                            
                                Count number of rows by group using dplyr
                            
                                Add an index (numeric ID) column to large data frame [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With