I would like to split expression with mathematical comparisons, e.g.
unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))
The results are:
[1] "var" "<" "3"
[1] "var" "=" "=" "5"
[1] "var" ">" "2"
For the 2nd example above, I would like to get [1] "var" "==" "5"
, so the two =
should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)
Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.
Splitting Strings in R programming – strsplit() method strsplit() method in R Programming Language is used to split the string by using a delimiter.
Strsplit(): An R Language function which is used to split the strings into substrings with split arguments. Where: X = input data file, vector or a stings.
You may use a PCRE regex to match the substrings you need:
==|[<>]|(?:(?!==)[^<>])+
To also support !=
, modify it as
[!=]=|[<>]|(?:(?![=!]=)[^<>])+
See the regex demo.
Details:
==
- 2 =
signs|
- or[<>]
- a <
or >
|
- or(?:(?!==)[^<>])+
- 1 or more chars other than <
and >
([^<>]
) that do not start a ==
char sequence (a tempered greedy token).NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.
R test:
> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1" "==" "text2" "<" "text3" ">"
[7] "<" "More here"
Expanding from my idea in comments, just for the formatting:
tests=c("var==5","var<3","var.name>5")
regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))
\w
is [a-zA-Z0-9_]
and \W
it's opposite [^a-zA-Z0-9_]
, I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).
So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.
Each step is captured, and this give:
[[1]]
[1] "var==5" "var" "==" "5"
[[2]]
[1] "var<3" "var" "<" "3"
[[3]]
[1] "var.name>5" "var.name" ">" "5"
you may add *
between each capture group if your entries could have space around the operator, if not the operator capture will get them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With