Group by using str_detect for groups with similar strings

Question

Consider this example data:

library(tidyverse)

dt <- tibble(Poison = c('Arsenic', 'Arsenic in Wine', 'Cyanide', 'Cyanide and Sugar'),
             Result = c('Death', 'Death With Class', 'Death', 'Death'))

I want to create a column that gives each group an identification number. However, I want the poisons to be grouped together by a string detection, i.e., 'Arsenic' and 'Arsenic in Wine' to be one group and 'Cyanide' and 'Cyanide and Sugar' to be another group. Currently, R thinks that each group is it's own, as such:

dt <- dt %>%
  group_by(Poison) %>%
  mutate(Group = n())

# A tibble: 4 × 3
# Groups:   Poison [4]
  Poison            Result           Group
  <chr>             <chr>            <int>
1 Arsenic           Death                1
2 Arsenic in Wine   Death With Class     1
3 Cyanide           Death                1
4 Cyanide and Sugar Death                1

I want it to be so that 'Arsenic' and 'Arsenic in Wine' is Group 1, and 'Cyanide', and 'Cyanide and Sugar' is Group 2. Any ideas?

r2evans · Accepted Answer

If we know ahead of time a vector of "shortest patterns",

vec <- c("Arsenic", "Cyanide")
### or perhaps this for an automated approach
vec <- unique(sub(" .*", "", dt$Poison))

then we can do:

dt |>
  mutate(grp = apply(sapply(vec, grepl, Poison), 1, function(z) which(z)[1]))
# # A tibble: 4 × 3
#   Poison            Result             grp
#   <chr>             <chr>            <int>
# 1 Arsenic           Death                1
# 2 Arsenic in Wine   Death With Class     1
# 3 Cyanide           Death                2
# 4 Cyanide and Sugar Death                2

KU99 · Answer

Use code with caution! ie small to medium dataset. Huge dataset won't work as adist creates a n by n matrix of the Poison column. ie comparing one element to the rest.

dt %>%
  mutate(group = (!adist(Poison, partial = TRUE)) %>%
           igraph::graph_from_adjacency_matrix()%>%
           igraph::components()%>%
           getElement('membership'))

# A tibble: 4 × 3
  Poison            Result           group
  <chr>             <chr>            <dbl>
1 Arsenic           Death                1
2 Arsenic in Wine   Death With Class     1
3 Cyanide           Death                2
4 Cyanide and Sugar Death                2

If at all you have the vector of needed groups you could do:

vec <- c("Arsenic", "Cyanide")
transform(dt, group = max.col(-t(adist(vec, Poison, partial = TRUE))))

             Poison           Result group
1           Arsenic            Death     1
2   Arsenic in Wine Death With Class     1
3           Cyanide            Death     2
4 Cyanide and Sugar            Death     2

Group by using str_detect for groups with similar strings

Tags:

r

dplyr

ksinva

2 Answers

r2evans

KU99

Recent Activity

Donate For Us

Group by using str_detect for groups with similar strings

Tags:

r

dplyr

ksinva

2 Answers

r2evans

KU99

Related questions

Recent Activity

Donate For Us