Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by using str_detect for groups with similar strings

Tags:

r

dplyr

Consider this example data:

library(tidyverse)

dt <- tibble(Poison = c('Arsenic', 'Arsenic in Wine', 'Cyanide', 'Cyanide and Sugar'),
             Result = c('Death', 'Death With Class', 'Death', 'Death'))

I want to create a column that gives each group an identification number. However, I want the poisons to be grouped together by a string detection, i.e., 'Arsenic' and 'Arsenic in Wine' to be one group and 'Cyanide' and 'Cyanide and Sugar' to be another group. Currently, R thinks that each group is it's own, as such:

dt <- dt %>%
  group_by(Poison) %>%
  mutate(Group = n())
# A tibble: 4 × 3
# Groups:   Poison [4]
  Poison            Result           Group
  <chr>             <chr>            <int>
1 Arsenic           Death                1
2 Arsenic in Wine   Death With Class     1
3 Cyanide           Death                1
4 Cyanide and Sugar Death                1

I want it to be so that 'Arsenic' and 'Arsenic in Wine' is Group 1, and 'Cyanide', and 'Cyanide and Sugar' is Group 2. Any ideas?

like image 959
ksinva Avatar asked Dec 06 '25 07:12

ksinva


2 Answers

If we know ahead of time a vector of "shortest patterns",

vec <- c("Arsenic", "Cyanide")
### or perhaps this for an automated approach
vec <- unique(sub(" .*", "", dt$Poison))

then we can do:

dt |>
  mutate(grp = apply(sapply(vec, grepl, Poison), 1, function(z) which(z)[1]))
# # A tibble: 4 × 3
#   Poison            Result             grp
#   <chr>             <chr>            <int>
# 1 Arsenic           Death                1
# 2 Arsenic in Wine   Death With Class     1
# 3 Cyanide           Death                2
# 4 Cyanide and Sugar Death                2
like image 112
r2evans Avatar answered Dec 08 '25 22:12

r2evans


Use code with caution! ie small to medium dataset. Huge dataset won't work as adist creates a n by n matrix of the Poison column. ie comparing one element to the rest.

dt %>%
  mutate(group = (!adist(Poison, partial = TRUE)) %>%
           igraph::graph_from_adjacency_matrix()%>%
           igraph::components()%>%
           getElement('membership'))

# A tibble: 4 × 3
  Poison            Result           group
  <chr>             <chr>            <dbl>
1 Arsenic           Death                1
2 Arsenic in Wine   Death With Class     1
3 Cyanide           Death                2
4 Cyanide and Sugar Death                2

If at all you have the vector of needed groups you could do:

vec <- c("Arsenic", "Cyanide")
transform(dt, group = max.col(-t(adist(vec, Poison, partial = TRUE))))

             Poison           Result group
1           Arsenic            Death     1
2   Arsenic in Wine Death With Class     1
3           Cyanide            Death     2
4 Cyanide and Sugar            Death     2
like image 42
KU99 Avatar answered Dec 08 '25 21:12

KU99



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!