Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove only the rows failed a test inside a group where a row passed the test

Tags:

r

dplyr

tibble

Background: I am using R to help me figure out which entries in file A (expected names) map to which entries in file B (possible names). Unfortunately, file B has multiple entries with the same ID and different names. Sometimes these names are exact matches of the ID in file A, but other times they are close or reworded. For now, I am finding those direct 1:1 matches, and removing the non-matches with the same ID.

Question: Is there a cleaner way to remove known non-matches inside of groups which contain 1:1 matches, while not removing non-matches in other groups?

Example data:

example <- tribble(
  ~id, ~expect, ~possible, ~status,
  1, "box of forks", "spoon drawer", "review",
  1, "box of forks", "box of forks", "match",
  1, "box of forks", "cheese knife", "review",
  2, "dish washer", "dish washing machine", "review",
  2, "dish washer", "oven", "review",
  2, "dish washer", "microwave", "review",
)

In this example, row 2 is a 1:1 match, so it was given the status "match". I, definitely a human, know that row 4 is also a match, because I can read and understand the data set. Since I expect there will be cases like this, I can't simply do filter(status == "match"). However, before I get into reading the file, I want to remove the rows that I now know are going to be junk. As such, I have figured out that I can do:

example %>% 
  group_by(id) %>% 
  mutate(matches = case_when(
    status == "review" ~ 0,
    status == "match" ~ 1,
  ),
  total = sum(matches)
  ) %>% 
  filter(
    !(matches == 0 & total > 0)
  )

Which gives the correct/expected result:

     id expect       possible             status matches total
  <dbl> <chr>        <chr>                <chr>    <dbl> <dbl>
1     1 box of forks box of forks         match        1     1
2     2 dish washer  dish washing machine review       0     0
3     2 dish washer  oven                 review       0     0
4     2 dish washer  microwave            review       0     0

While this works, it looks clunky and makes me sad. Is there a cleaner way of removing entries that are still "review" and exist inside of groups which contain at least one row with "match"?

like image 972
David Robie Avatar asked Oct 28 '25 02:10

David Robie


2 Answers

Maybe just use all.

example %>%
  filter(all(status=="review") | status=="match", .by=id)

# A tibble: 4 × 4
     id expect       possible             status
  <dbl> <chr>        <chr>                <chr> 
1     1 box of forks box of forks         match 
2     2 dish washer  dish washing machine review
3     2 dish washer  oven                 review
4     2 dish washer  microwave            review
like image 93
Edward Avatar answered Oct 30 '25 17:10

Edward


This will keep either matches or all observations for the id where there are no matches:

example |>
  filter(status == "match" | 
         !any(status == "match"), .by = id)

Result

     id expect       possible             status
  <dbl> <chr>        <chr>                <chr> 
1     1 box of forks box of forks         match 
2     2 dish washer  dish washing machine review
3     2 dish washer  oven                 review
4     2 dish washer  microwave            review
like image 28
Jon Spring Avatar answered Oct 30 '25 17:10

Jon Spring



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!