I would like to split expression with mathematical comparisons, e.g. <pre class="prettyprint"><code>unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE)) unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE)) unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE)) </code></pre> The results are: <pre class="prettyprint"><code>[1] "var" "<" "3" [1] "var" "=" "=" "5" [1] "var" ">" "2" </code></pre> For the 2nd example above, I would like to get <code>[1] "var" "==" "5"</code>, so the two <code>=</code> should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

You may use a PCRE regex to match the substrings you need: <pre class="prettyprint"><code>==|[<>]|(?:(?!==)[^<>])+ </code></pre> To also support <code>!=</code>, modify it as <pre class="prettyprint"><code>[!=]=|[<>]|(?:(?![=!]=)[^<>])+ </code></pre> See the regex demo. Details: <ul> <li> <code>==</code> - 2 <code>=</code> signs</li> <li> <code>|</code> - or</li> <li> <code>[<>]</code> - a <code><</code> or <code>></code> </li> <li> <code>|</code> - or</li> <li> <code>(?:(?!==)[^<>])+</code> - 1 or more chars other than <code><</code> and <code>></code> (<code>[^<>]</code>) that do not start a <code>==</code> char sequence (a tempered greedy token).</li> </ul> NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token. R test: <pre class="prettyprint"><code>> text <- "Text1==text2<text3><More here" > res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE)) > res [[1]] [1] "Text1" "==" "text2" "<" "text3" ">" [7] "<" "More here" </code></pre>

Expanding from my idea in comments, just for the formatting: <pre class="prettyprint"><code>tests=c("var==5","var<3","var.name>5") regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests)) </code></pre> <code>\w</code> is <code>[a-zA-Z0-9_]</code> and <code>\W</code> it's opposite <code>[^a-zA-Z0-9_]</code>, I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE). So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot. Each step is captured, and this give: <pre class="prettyprint"><code>[[1]] [1] "var==5" "var" "==" "5" [[2]] [1] "var<3" "var" "<" "3" [[3]] [1] "var.name>5" "var.name" ">" "5" </code></pre> you may add <code>*</code> between each capture group if your entries could have space around the operator, if not the operator capture will get them.

Split character vector at math comparisons signs in R

Tags:

regex

r

I would like to split expression with mathematical comparisons, e.g.

unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))

The results are:

[1] "var" "<"   "3"  
[1] "var" "="   "="   "5"  
[1] "var" ">"   "2"

For the 2nd example above, I would like to get [1] "var" "==" "5", so the two = should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

237

asked Nov 23 '16 08:11

Daniel

2 Answers

You may use a PCRE regex to match the substrings you need:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

Details:

== - 2 = signs
| - or
[<>] - a < or >
| - or
(?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

R test:

> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1"     "=="        "text2"     "<"         "text3"     ">"        
[7] "<"         "More here"

answered Sep 30 '22 20:09

Wiktor Stribiżew

Expanding from my idea in comments, just for the formatting:

tests=c("var==5","var<3","var.name>5")
regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))

\w is [a-zA-Z0-9_] and \W it's opposite [^a-zA-Z0-9_], I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).

So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.

Each step is captured, and this give:

[[1]]
[1] "var==5" "var"    "=="     "5"     

[[2]]
[1] "var<3" "var"   "<"     "3"    

[[3]]
[1] "var.name>5" "var.name"   ">"          "5"

you may add * between each capture group if your entries could have space around the operator, if not the operator capture will get them.

answered Sep 30 '22 20:09

Tensibai

Related questions
                            
                                Why should someone use {} for initializing an empty object in R?
                            
                                How to find the border points of a particular shape
                            
                                How to merge colour and shape?
                            
                                constrained optimization in R
                            
                                How do I plot the first derivative of the smoothing function?
                            
                                facet_wrap fill by column
                            
                                Select along one of n dimensions in array
                            
                                Fill superimposed ellipses in ggplot2 scatterplots
                            
                                How to convert a sparse matrix into a matrix of index and value of non-zero element
                            
                                R: sparse matrix conversion
                            
                                Why can 'hallo\nworld' match both \n and \\n in R?
                            
                                Approaches for spatial geodesic latitude longitude clustering in R with geodesic or great circle distances
                            
                                Is there a way to delete all comments in a R script using RStudio?
                            
                                R-Project: xlsx package installation failure (due to java issues)
                            
                                devtools::install_github fails with CA cert error
                            
                                Efficiently plotting millions of data points in R
                            
                                Assign point color depending on data.frame column value R
                            
                                How to change and remove default library location?
                            
                                Resize plotly R ggplotly
                            
                                How do you check for a scalar in R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With