Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr override all but the first occurrences of a value within a group

Tags:

r

dplyr

I have a grouped data_frame with a "tag" column taking on values "0" and "1". In each group, I need to find the first occurrence of "1" and change all the remaining occurrences to "0". Is there a way to achieve it in dplyr?

For example, let's take "iris" data and let's add the extra "tag" column:

data(iris)
set.seed(1)
iris$tag <- sample( c(0, 1), 150, replace = TRUE, prob = c(0.8, 0.2))
giris <- iris %>% group_by(Species)

In "giris", in the "setosa" group I need to keep only the first occurrence of "1" (i.e. in 4th row) and set the remaining ones to "0". This seems a bit like applying a mask or something...

Is there a way to do it? I have been experimenting with "which" and "duplicated" but I did not succeed. I have been thinking about filtering the "1"s only, keeping them, then joining with the remaining set, but this seems awkward, especially for a 12GB data set.

like image 669
rpl Avatar asked Mar 18 '16 08:03

rpl


People also ask

How to override the group_by() function in dplyr?

You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function. This message helps to make the user aware that a grouping was performed.

How do I select the first row in each group using dplyr?

Often you may want to select the first row in each group using the dplyr package in R. You can use the following basic syntax to do so: df %>% group_by(group_var) %>% arrange(values_var) %>% filter(row_number ()==1) The following example shows how to use this function in practice.

How does grouping affect the main dplyr verbs?

The following sections describe how grouping affects the main dplyr verbs. summarise () computes a summary for each group. This means that it starts from group_keys (), adding summary variables to the right hand side: The .groups= argument controls the grouping structure of the output.

Why has my summarise() function grouped output by ‘X’?

The reason for the message “`summarise ()` has grouped output by ‘X’. You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function.


2 Answers

A dplyr option:

mutate(giris, newcol = as.integer(tag & cumsum(tag) == 1))

Or

mutate(giris, newcol = as.integer(tag & !duplicated(tag)))

Or using data.table, same approach, but modify by reference:

library(data.table)
setDT(giris)
giris[, newcol := as.integer(tag & cumsum(tag) == 1), by = Species]
like image 90
talat Avatar answered Oct 17 '22 02:10

talat


We can try

res <- giris %>%
         group_by(Species) %>% 
         mutate(tag1 = ifelse(cumsum(c(TRUE,diff(tag)<0))!=1, 0, tag))

table(res[c("Species", "tag1")])
#            tag1
#Species      0  1
# setosa     49  1
# versicolor 49  1
# virginica  49  1
like image 38
akrun Avatar answered Oct 17 '22 04:10

akrun