I'm struggling to strip down comma-separated strings to unique substrings in a cleanly fashion:
x <- c("Anna & x, Anna & x", #
"Alb, Berta 222, Alb",
"Al Pacino",
"Abb cd xy, Abb cd xy, C123, C123, B")
I seem to be doing fine with this combination of negative characterclass, negative lookahead and backreference; however what bothers me is that in many substrings there is unwanted whitespace:
library(stringr)
str_extract_all(x, "([^,]+)(?!.*\\1)")
[[1]]
[1] " Anna & x"
[[2]]
[1] " Berta 222" " Alb"
[[3]]
[1] "Al Pacino"
[[4]]
[1] " Abb cd xy" " C123" " B"
How can the pattern be refined so that no unwanted whitespace gets extracted?
Desired result:
#> [[1]]
#> [1] "Anna & x"
#> [[2]]
#> [1] "Alb" "Berta 222"
#> [[3]]
#> [1] "Al Pacino"
#> [[4]]
#> [1] "Abb cd xy" "C123" "B"
EDIT:
Just wanted to share this solution with double negative lookahead, which also works well (and thanks for the many useful solutions proposed!)
str_extract_all(x, "((?!\\s)[^,]+)(?!.*\\1)")
You can use str_split
to get the individual substrings, followed by unique
to remove repeated strings. For example:
library(tidyverse)
str_split(x, ", ?") %>% map(unique)
[[1]] [1] "Anna & x"
[[2]] [1] "Alb" "Berta 222"
[[3]] [1] "Al Pacino"
[[4]] [1] "Abb cd xy" "C123" "B"
If you want the output as a single vector of unique strings, you could do:
unique(unlist(str_split(x, ", ?")))
[1] "Anna & x" "Alb" "Berta 222" "Al Pacino" "Abb cd xy" "C123" "B"
In the code above we used the regex ", ?"
to split at a comma or a comma followed by a space so that we don't end up with a leading space. For future reference, if you do need to get rid of leading or trailing whitespace, you can use str_trim
. For example, if we had used ","
in str_split
we could do the following:
str_split(x, ",") %>%
map(str_trim) %>%
map(unique)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With