I'm trying to split a string in R (using strsplit) at some specific points (dash, -) however not if the dash are within a string in brackets ([).
Example:
xx <- c("Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
xx
[1] "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[2] "Total Internet-Time Spent Online-Past 7 Days"
should give me something like:
list(c("Radio Stations","Listened to Past Week","Toronto [FM-CFXJ-93.5 (93.5 The Move)]"), c("Total Internet","Time Spent Online","Past 7 Days"))
[[1]]
[1] "Radio Stations" "Listened to Past Week"
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[[2]]
[1] "Total Internet" "Time Spent Online" "Past 7 Days"
Is there a way with regular expression to do this? The position and the number of dashs change within each elements of the vector, and there is not always brackets. However, when there are brackets, they are always at the end.
I've tried different things, but none are working:
## Trying to match "-" before "[" in Perl
strsplit(xx, split = "-(?=\\[)", perl=T)
# does nothing
## trying to first extract what follow "[" then splitting what is preceding that
temp <- strsplit(xx, "[", fixed = T)
temp <- lapply(temp, function(yy) substr(head(yy, -1),"-"))
# doesn't work as there are some elements with no brackets...
Any help would be appreciated.
Based on: Regex for matching a character, but not when it's enclosed in square bracket
You can use:
strsplit(xx, "-(?![^\\[]*\\])", perl = TRUE)
[[1]]
[1] "Radio Stations" "Listened to Past Week"
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
[[2]]
[1] "Total Internet" "Time Spent Online" "Past 7 Days"
To match a - that is not inside [ and ] you must match a part of the string that is enclosed with [ and ] and omit it, and match - in all other contexts. In abc-def], the - is not in between [ and ] and acc. to the specs should not be split against.
It is done with this regex:
\[[^][]*](*SKIP)(*FAIL)|-
Here,
\[ - matches a [[^][]* - zero or more chars other than [ and ] (if you use [^]] it will match any char but ])] - a literal ](*SKIP)(*FAIL)- PCRE verbs that omit the match and make the engine go on looking for the match after the end of the omitted one| - or- - a hyphen in other contexts. Or, to match [...[...] like substrings (demo):
\[[^]]*](*SKIP)(*FAIL)|-
Or, to account for nested square brackets (demo):
(\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-
Here, (\[(?:[^][]++|(?1))*]) matches and captures [, then 1+ chars other than [ and ] (with [^][]++) or (|) (?1) recurses the whole capturing group 1 pattern (the whole part between (...)).
See the R demo:
xx <- c("abc-def]", "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
pattern <- "\\[[^][]*](*SKIP)(*FAIL)|-"
strsplit(xx, pattern, perl=TRUE)
# [[1]]
# [1] "abc" "def]"
# [[2]]
# [1] "Radio Stations"
# [2] "Listened to Past Week"
# [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
# [[3]]
# [1] "Total Internet" "Time Spent Online" "Past 7 Days"
pattern_recursive <- "(\\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-"
xx2 <- c("Radio Stations-Listened to Past Week-Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
strsplit(xx2, pattern_recursive, perl=TRUE)
# [[1]]
# [1] "Radio Stations"
# [2] "Listened to Past Week"
# [3] "Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]"
# [[2]]
# [1] "Total Internet" "Time Spent Online" "Past 7 Days"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With