Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split character vector at math comparisons signs in R

Tags:

regex

r

I would like to split expression with mathematical comparisons, e.g.

unlist(strsplit("var<3", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var==5", "(?=[=<>])", perl = TRUE))
unlist(strsplit("var>2", "(?=[=<>])", perl = TRUE))

The results are:

[1] "var" "<"   "3"  
[1] "var" "="   "="   "5"  
[1] "var" ">"   "2"  

For the 2nd example above, I would like to get [1] "var" "==" "5", so the two = should be returned as a single element. How do I need to change my regular expression to achieve this? (I already tried grouping and quantifiers for "==", but nothing worked - regular expressions are not my friends...)

like image 237
Daniel Avatar asked Nov 23 '16 08:11

Daniel


People also ask

How do you split a character vector in R?

Note that splitting into single characters can be done via split = character(0) or split = "" ; the two are equivalent.

How do I split a string into characters in R?

Splitting Strings in R programming – strsplit() method strsplit() method in R Programming Language is used to split the string by using a delimiter.

What does Strsplit do in R?

Strsplit(): An R Language function which is used to split the strings into substrings with split arguments. Where: X = input data file, vector or a stings.


2 Answers

You may use a PCRE regex to match the substrings you need:

==|[<>]|(?:(?!==)[^<>])+

To also support !=, modify it as

[!=]=|[<>]|(?:(?![=!]=)[^<>])+

See the regex demo.

Details:

  • == - 2 = signs
  • | - or
  • [<>] - a < or >
  • | - or
  • (?:(?!==)[^<>])+ - 1 or more chars other than < and > ([^<>]) that do not start a == char sequence (a tempered greedy token).

NOTE: This is easily expandable by adding more alternatives and adjusting the tempered greedy token.

R test:

> text <- "Text1==text2<text3><More here"
> res <- regmatches(text, gregexpr("==|[<>]|(?:(?!==)[^<>])+", text, perl=TRUE))
> res
[[1]]
[1] "Text1"     "=="        "text2"     "<"         "text3"     ">"        
[7] "<"         "More here"
like image 56
Wiktor Stribiżew Avatar answered Sep 30 '22 20:09

Wiktor Stribiżew


Expanding from my idea in comments, just for the formatting:

tests=c("var==5","var<3","var.name>5")
regmatches(tests,regexec("([a-zA-Z0-9_.]+)(\\W+)([a-zA-Z0-9_.]+)",tests))

\w is [a-zA-Z0-9_] and \W it's opposite [^a-zA-Z0-9_], I expanded it after comment to include . in the character class, and as R doesn't support \w in character class in base regex (need to use perl=TRUE).

So the regex search for a least 1 of \w and ., then a least 1 not in \w (to match operators) and then a least 1 of \w and dot.

Each step is captured, and this give:

[[1]]
[1] "var==5" "var"    "=="     "5"     

[[2]]
[1] "var<3" "var"   "<"     "3"    

[[3]]
[1] "var.name>5" "var.name"   ">"          "5"       

you may add * between each capture group if your entries could have space around the operator, if not the operator capture will get them.

like image 43
Tensibai Avatar answered Sep 30 '22 20:09

Tensibai