I was trying to answer a question (that later got deleted) that I think was asking about extracting text representations of scientific notation. (Using R's implementation of regex that requires double escapes for meta-characters and can be used in either pure PCRE or Perl modes, the difference between which I don't really understand.) I've solved most of the task but still seem to be failing to capture the leading minus-sign within a capture group. The only way I seem to get it to succeed is by using the leading open-parenthesis:
> txt <- c("this is some random text (2.22222222e-200)", "other random (3.33333e4)", "yet a third(-1.33333e-40)", 'and a fourth w/o the "e" (2.22222222-200)')
> sub("^(.+\\()([-+]{0,1}[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
> sub("^(.+\\()([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
#but that seems to be "cheating" ... my failures follow:
> sub("^(.+)([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4" "1.33333e-40" "2.22222222-200"
> sub("^(.+)(-?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4" "1.33333e-40" "2.22222222-200"
> sub("^(.+)(-*[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4" "1.33333e-40" "2.22222222-200"
I've searched SO to the extent of my patience with terms like `scientific notation regex minus'
- the minus sign indicates a range in a character class (when it is not at the first position after the "[" opening bracket or the last position before the "]" closing bracket. Example: "[A-Z]" matches any uppercase character. Example: "[A-Z-]" or "[-A-Z]" match any uppercase character or "-".
The .* is a wildcard expression that matches any sequence of characters including an empty sequence of length=0. grep a.*z matches all of the following strings that start with a and end with z: "abcdefghijklmnopqrstuvwxyz", "abz", "abbz", "ahhhhhz" and "abbdz".
\f stands for form feed, which is a special character used to instruct the printer to start a new page.
You can try
library(stringr)
unlist(str_extract_all(txt, '-?[0-9.]+e?[-+]?[0-9]*'))
#[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
Using method based on capturing after leading parentheses
str_extract(txt, '(?<=\\()[^)]*')
#[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
Reasoning that it was the "greedy" capacity of the "(.+)" first capture group to gobble up the minus sign that was optional in the second capture-group, I terminated the first capture-group with a negation-character-class and now have success. This still seems clunky and hoping there is something more elegant. In searching have seen Python code that seems to imply that there are regex definitions of "&real_number">
> sub("^(.+[^-+])([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt,perl=TRUE)
[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
After looking at the code in str_extract_all which uses substr to extract matches, I now think I should have chosen the gregexpr-regmatches paradigm for my efforts rather than the pick-the-middle of-a-three-capture-group strategy:
> hits <- gregexpr('[-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3}', txt)
> ?regmatches
> regmatches(txt, hits)
[[1]]
[1] "2.22222222e-200"
[[2]]
[1] "3.33333e4"
[[3]]
[1] "-1.33333e-40"
[[4]]
[1] "2.22222222-200"
This seems to work, and won't match an IP address:
sub("^.*?([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "2.22222222e-200" "3.33333e4" "-1.33333e-40" "2.22222222-200"
Oddly, that's not quite the regex I started with. When try one didn't work, I thought I would go back and test in Perl:
my @txt = (
"this is some random text (2.22222222e-200)",
"other random (3.33333e4)",
"yet a third(-1.33333e-40)" ,
'and a fourth w/o the "e" (2.22222222-200)');
map { s/^.*?[^-+]([-+]?\d+(?:\.\d*)*(?:[Ee]?[-+]?\d+)?).*?$/$1/ } @txt;
print join("\n", @txt),"\n";
And that looked good:
2.22222222e-200
3.33333e4
-1.33333e-40
2.22222222-200
So the same regex should work in R, right?
sub("^.*?[^-+]([-+]?\\d+(?:\\.\\d*)*(?:[Ee]?[-+]?\\d+)?).*?$", "\\1", txt)
[1] "0" "4" "0" "0"
Apparently not. I even confirmed that the double-quoted string is correct by trying it in Javascript with new RegExp("
...")
, and it worked fine there, too. Not sure what's different about R, but removing the negated sign character class did the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With