I have a grouped data_frame with a "tag" column taking on values "0" and "1". In each group, I need to find the first occurrence of "1" and change all the remaining occurrences to "0". Is there a way to achieve it in dplyr?
For example, let's take "iris" data and let's add the extra "tag" column:
data(iris)
set.seed(1)
iris$tag <- sample( c(0, 1), 150, replace = TRUE, prob = c(0.8, 0.2))
giris <- iris %>% group_by(Species)
In "giris", in the "setosa" group I need to keep only the first occurrence of "1" (i.e. in 4th row) and set the remaining ones to "0". This seems a bit like applying a mask or something...
Is there a way to do it? I have been experimenting with "which" and "duplicated" but I did not succeed. I have been thinking about filtering the "1"s only, keeping them, then joining with the remaining set, but this seems awkward, especially for a 12GB data set.
You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function. This message helps to make the user aware that a grouping was performed.
Often you may want to select the first row in each group using the dplyr package in R. You can use the following basic syntax to do so: df %>% group_by(group_var) %>% arrange(values_var) %>% filter(row_number ()==1) The following example shows how to use this function in practice.
The following sections describe how grouping affects the main dplyr verbs. summarise () computes a summary for each group. This means that it starts from group_keys (), adding summary variables to the right hand side: The .groups= argument controls the grouping structure of the output.
The reason for the message “`summarise ()` has grouped output by ‘X’. You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function.
A dplyr option:
mutate(giris, newcol = as.integer(tag & cumsum(tag) == 1))
Or
mutate(giris, newcol = as.integer(tag & !duplicated(tag)))
Or using data.table, same approach, but modify by reference:
library(data.table)
setDT(giris)
giris[, newcol := as.integer(tag & cumsum(tag) == 1), by = Species]
We can try
res <- giris %>%
group_by(Species) %>%
mutate(tag1 = ifelse(cumsum(c(TRUE,diff(tag)<0))!=1, 0, tag))
table(res[c("Species", "tag1")])
# tag1
#Species 0 1
# setosa 49 1
# versicolor 49 1
# virginica 49 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With