Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare strings in dplyr

Tags:

r

dplyr

tidyverse

My dataframe looks like:

df <- tibble::tribble(
  ~order_id, ~user_id,      ~comp_order,           ~comp_rec,
    1164320,    32924, "4-6-22-11-37-5", "4-5-6-11-22-36-37",
    1169182,    33128,  "9-4-15-28-8-7",  "4-7-8-9-28-37-38",
    1166014,    33003,    "27-22-4-6-5", "4-5-6-22-27-36-37",
    1166019,    32996,    "27-22-4-6-8", "4-6-8-22-27-36-38"
  )

I want to know what digit is present in the comp_order column and not in the comp_rec.

The final output should look like:

  order_id user_id comp_rec          comp_order     is_equal elements_removed_from_rec elements_added_to_order
     <dbl>   <dbl> <chr>             <chr>          <chr>    <chr>                     <chr>                  
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       36                        none                   
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       37                        none                   
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       36-37                     none                   
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       36-38                     none                   
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      none                      none                   
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       none                      33                     

df_output <- tibble::tribble(
               ~order_id, ~user_id,           ~comp_rec,      ~comp_order, ~is_equal, ~elements_removed_from_rec, ~elements_added_to_order,
                 1164320,    32924, "4-5-6-11-22-36-37", "4-6-22-11-37-5",      "no",                       "36",                   "none",
                 1169182,    33128,  "4-7-8-9-28-37-38",  "9-4-15-28-8-7",      "no",                       "37",                   "none",
                 1166014,    33003, "4-5-6-22-27-36-37",    "27-22-4-6-5",      "no",                    "36-37",                   "none",
                 1166019,    32996, "4-6-8-22-27-36-38",    "27-22-4-6-8",      "no",                    "36-38",                   "none",
                 1166012,    32922,       "27-22-4-6-8",    "27-22-4-6-8",     "yes",                     "none",                   "none",
                 1166033,    32911,       "27-22-4-6-8", "27-22-4-6-8-33",      "no",                     "none",                     "33"
               )

I would need to know:

  1. what has been removed from rec
  2. what has been added to order

In terms of digits in the string.

The issues that the order of the digits in the string is not necessarily the same...

How can I make a comparison between these 2 strings?

like image 557
chopin_is_the_best Avatar asked Sep 11 '25 17:09

chopin_is_the_best


2 Answers

Here's an approach with purrr:

First, we use purrr:map to split the elements on - into a list of elements. Then, we use purrr:map2 to perform setdiff on the lists to identify the different elements.

If both evaluate to "", then we know they are the same, and so we can use case_when to determine the is_equal column.

Then we can clean up by removing the list columns.

library(dplyr)
library(purrr)
df %>%
  mutate(comp_order_list = map(comp_order, ~str_split(.,"-", simplify = TRUE)),
         comp_rec_list = map(comp_rec, ~str_split(.,"-", simplify = TRUE)),
         elements_removed = map2_chr(comp_rec_list,comp_order_list,
                                   ~ paste(setdiff(.x,.y),collapse = "-")),
         elements_added = map2_chr(comp_order_list,comp_rec_list,
                                     ~ paste(setdiff(.x,.y),collapse = "-")),
         is_equal = case_when(elements_removed == "" & elements_added == "" ~ "yes",
                              TRUE ~ "no")) %>%
  dplyr::select(order_id,user_id,comp_rec,comp_order,is_equal,elements_removed,elements_added)
# A tibble: 6 x 7
  order_id user_id comp_rec          comp_order     is_equal elements_removed elements_added
     <dbl>   <dbl> <chr>             <chr>          <chr>    <chr>            <chr>         
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       "36"             ""            
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       "37-38"          "15"          
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       "36-37"          ""            
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       "36-38"          ""            
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      ""               ""            
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       ""               "33"    
like image 182
Ian Campbell Avatar answered Sep 14 '25 06:09

Ian Campbell


Here one way of doing it that utilises setdiff at its core:

# These two lines split the strings and store them in one cell in the dataframe
df$digits_order <- str_split(df$comp_order, "-")
df$digits_rec <- str_split(df$comp_rec, "-")

# The following apply function iterates over rows and applies setdiff. 
# paste is used to stitch them together if apply returns multiple digits.  
df$in_rec_but_not_order <- apply(df, 1, function(row) paste(setdiff(row$digits_rec, row$digits_order), collapse = "-"))
df$in_order_but_not_rec <- apply(df, 1, function(row) paste(setdiff(row$digits_order, row$digits_rec), collapse = "-"))

df
# A tibble: 4 x 8
  order_id user_id comp_order     comp_rec          digits_order digits_rec in_order_but_not_rec in_rec_but_not_order
     <dbl>   <dbl> <chr>          <chr>             <list>       <list>     <chr>                <chr>               
1  1164320   32924 4-6-22-11-37-5 4-5-6-11-22-36-37 <chr [6]>    <chr [7]>  ""                   36                  
2  1169182   33128 9-4-15-28-8-7  4-7-8-9-28-37-38  <chr [6]>    <chr [7]>  15                   37-38               
3  1166014   33003 27-22-4-6-5    4-5-6-22-27-36-37 <chr [5]>    <chr [7]>  ""                   36-37               
4  1166019   32996 27-22-4-6-8    4-6-8-22-27-36-38 <chr [5]>    <chr [7]>  ""                   36-38               

Here, the last column contains the digits missing from the record.

like image 27
coffeinjunky Avatar answered Sep 14 '25 08:09

coffeinjunky