Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to strip down comma-separated strings to unique substrings

Tags:

regex

r

I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:

x <- c("Anna & x, Anna & x", #
       "Alb, Berta 222, Alb", 
       "Al Pacino", 
       "Abb cd xy, Abb cd xy, C123, C123, B")

I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:

library(stringr)
str_extract_all(x, "([^,]+)(?!.*\\1)")
[[1]]
[1] " Anna & x"

[[2]]
[1] " Berta 222" " Alb"      

[[3]]
[1] "Al Pacino"

[[4]]
[1] " Abb cd xy" " C123"      " B"

How can the pattern be refined so that no unwanted whitespace gets extracted?

Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb"       "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123"      "B"

EDIT:

Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)

str_extract_all(x, "((?!\\s)[^,]+)(?!.*\\1)")
like image 671
Chris Ruehlemann Avatar asked Nov 30 '22 13:11

Chris Ruehlemann


1 Answers

You can use str_split to get the individual substrings, followed by unique to remove repeated strings. For example:

library(tidyverse)

str_split(x, ", ?") %>% map(unique)

[[1]] [1] "Anna & x"

[[2]] [1] "Alb" "Berta 222"

[[3]] [1] "Al Pacino"

[[4]] [1] "Abb cd xy" "C123" "B"

If you want the output as a single vector of unique strings, you could do:

unique(unlist(str_split(x, ", ?")))

[1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123" "B"

In the code above we used the regex ", ?" to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim. For example, if we had used "," in str_split we could do the following:

str_split(x, ",") %>% 
  map(str_trim) %>% 
  map(unique)
like image 171
eipi10 Avatar answered Dec 05 '22 01:12

eipi10