Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an "unfilter" in dplyr to merge changes with the original dataset?

Tags:

r

dplyr

Let's say I have two data.frames like so:

bad_ids = read.table(text="id n
123 3", header = T)

dat <- read.table(text="id n partner_id
123 3 555
123 3 345
123 3 092
245 1 438
888 1 333", header=T)

I want to identify all the rows in dat that match the id column in bad_ids. I then want to create a "flag" variable that is set to 1 for all but the first match. The resulting data.frame would look like:

dat <- read.table(text="id n partner_id flag 
123 3 555 0
123 3 345 1
123 3 092 1
245 1 438 0
888 1 333 0", header=T)

Notice that the first row of 123 has a flag of 0. I want to flag all but the first match.

My strategy for emulating this behavior was something like the following:

# Flag the Duplicate Rows
dat %>% 
  filter(id %in% bad_ids$id) %>%
  slice(-1) %>% # delete the first row
  mutate(flag = 1) #create the id on all but the first match %>%
  unfilter() # this is the function I want to go back to the original, unfiltered dataset

I'm wondering if there's some equivalent of "unfilter" that allows me to re-merge with the original dataset?

like image 342
Parseltongue Avatar asked Nov 12 '19 20:11

Parseltongue


People also ask

What does dplyr filter do?

The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.

How do I remove filter from dplyr?

There is no function to un-filter or clear filters or un-subset in dplyr. Instead, to keep all rows and perform a calculation on another specific subset, you can apply a conditional with ifelse().

Which dplyr function is used to add new columns based on existing values?

dplyr - R function to add multiple new columns based on values from a group of columns - Stack Overflow.


1 Answers

One option is to create the 'flag' as a logical vector with %in% by comparing the 'bad_ids' 'id' column, then grouped by 'id', change the 'flag' by creating another condition with row_number()

library(dplyr)
dat %>% 
   mutate(flag = id %in% bad_ids$id) %>% 
   group_by(id) %>% 
   mutate(flag = +(row_number() > 1 & flag))
   #or use `duplicated`
   # mutate(flag = +(duplicated(flag) & flag))
# A tibble: 5 x 4
# Groups:   id [3]
#     id     n partner_id  flag
#  <int> <int>      <int> <int>
#1   123     3        555     0
#2   123     3        345     1
#3   123     3         92     1
#4   245     1        438     0
#5   888     1        333     0

Also, if we use the approach from the OP's code, an option is to join and then replace the NA with 0

dat %>% 
  filter(id %in% bad_ids$id) %>%
  slice(-1) %>%
  mutate(flag = 1) %>% 
  right_join(dat) %>% 
  mutate(flag = replace_na(flag, 0))
like image 144
akrun Avatar answered Oct 11 '22 19:10

akrun