Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to capture minus sign in scientific notation with regex?

I was trying to answer a question (that later got deleted) that I think was asking about extracting text representations of scientific notation. (Using R's implementation of regex that requires double escapes for meta-characters and can be used in either pure PCRE or Perl modes, the difference between which I don't really understand.) I've solved most of the task but still seem to be failing to capture the leading minus-sign within a capture group. The only way I seem to get it to succeed is by using the leading open-parenthesis:

> txt <- c("this is some random text (2.22222222e-200)", "other random (3.33333e4)", "yet a third(-1.33333e-40)", 'and a fourth w/o the "e" (2.22222222-200)')
> sub("^(.+\\()([-+]{0,1}[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 

> sub("^(.+\\()([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 
 #but that seems to be "cheating" ... my failures follow:

> sub("^(.+)([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-*[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 

I've searched SO to the extent of my patience with terms like `scientific notation regex minus'

like image 585
IRTFM Avatar asked May 03 '15 19:05

IRTFM


People also ask

How do you add a minus sign in regex?

- the minus sign indicates a range in a character class (when it is not at the first position after the "[" opening bracket or the last position before the "]" closing bracket. Example: "[A-Z]" matches any uppercase character. Example: "[A-Z-]" or "[-A-Z]" match any uppercase character or "-".

What is the meaning of * in regex?

The .* is a wildcard expression that matches any sequence of characters including an empty sequence of length=0. grep a.*z matches all of the following strings that start with a and end with z: "abcdefghijklmnopqrstuvwxyz", "abz", "abbz", "ahhhhhz" and "abbdz".

What does F mean in regex?

\f stands for form feed, which is a special character used to instruct the printer to start a new page.


3 Answers

You can try

 library(stringr)
 unlist(str_extract_all(txt, '-?[0-9.]+e?[-+]?[0-9]*'))
 #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 

Using method based on capturing after leading parentheses

 str_extract(txt, '(?<=\\()[^)]*')
 #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 
like image 68
akrun Avatar answered Oct 21 '22 02:10

akrun


Reasoning that it was the "greedy" capacity of the "(.+)" first capture group to gobble up the minus sign that was optional in the second capture-group, I terminated the first capture-group with a negation-character-class and now have success. This still seems clunky and hoping there is something more elegant. In searching have seen Python code that seems to imply that there are regex definitions of "&real_number">

> sub("^(.+[^-+])([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt,perl=TRUE)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 

After looking at the code in str_extract_all which uses substr to extract matches, I now think I should have chosen the gregexpr-regmatches paradigm for my efforts rather than the pick-the-middle of-a-three-capture-group strategy:

> hits <- gregexpr('[-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3}', txt)
> ?regmatches
> regmatches(txt, hits)
[[1]]
[1] "2.22222222e-200"

[[2]]
[1] "3.33333e4"

[[3]]
[1] "-1.33333e-40"

[[4]]
[1] "2.22222222-200"
like image 35
IRTFM Avatar answered Oct 21 '22 01:10

IRTFM


This seems to work, and won't match an IP address:

sub("^.*?([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"

Oddly, that's not quite the regex I started with. When try one didn't work, I thought I would go back and test in Perl:

my @txt = (
  "this is some random text (2.22222222e-200)",
  "other random (3.33333e4)",
  "yet a third(-1.33333e-40)" ,
  'and a fourth w/o the "e" (2.22222222-200)');

map { s/^.*?[^-+]([-+]?\d+(?:\.\d*)*(?:[Ee]?[-+]?\d+)?).*?$/$1/ } @txt;

print join("\n", @txt),"\n";

And that looked good:

2.22222222e-200
3.33333e4
-1.33333e-40
2.22222222-200

So the same regex should work in R, right?

sub("^.*?[^-+]([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "0" "4" "0" "0"

Apparently not. I even confirmed that the double-quoted string is correct by trying it in Javascript with new RegExp("..."), and it worked fine there, too. Not sure what's different about R, but removing the negated sign character class did the trick.

like image 29
Mark Reed Avatar answered Oct 21 '22 02:10

Mark Reed