I'm searching text_
which is: 本周(3月25日-3月31日),国内油厂开机率继续下降,全国各地油厂大豆压榨总量1456000吨(出粕1157520吨,出油262080吨),较上周的...[continued]
crush <- str_extract(string = text_, pattern = perl("(?<=量).*(?=吨(出粕)"))
meal <- str_extract(string = text_, pattern = perl("(?<=粕).*(?=吨,出)"))
oil <- str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨))"))
prints
[1] "1456000" ## correct
[1] "1157520" ## correct
[1] NA ## looking for 262080 here
Why do the first two match but not the last one? I'm using the stringr
library.
Note that current version of stringr
package is based on ICU regex library, and using perl()
is deprecated.
Note that lookbehind patterns are fixed-width, and it seems that there is a problem with how ICU parses the first letter in your lookbehind pattern (it cannot calculate its width for some unknown reason).
Since you are using stringr
, you may just rely on capturing that can be achieved with str_match
, to extract a part of the pattern:
> match <- str_match(s, "出油(\\d+)吨")
> match[,2]
[1] "262080"
This way, you will avoid any eventual issues in the future. Also, these regexps are executed faster since there is no unanchored lookbehind in the pattern that is executed at every location in the searched string.
Also, you may just use your PCRE regex with base R:
> regmatches(s, regexpr("(?<=出油)\\d+(?=吨)", s, perl=TRUE))
[1] "262080"
For some reason, still don't know, I wasn't able to use @WiktorStribiżew 's commented solution, but this ended up working:
oil <- str_extract(string = text_, pattern = perl("(?<=吨).*(?=吨)"))
# [1] "(出粕1157520吨,出油262080吨),较
oil <- str_extract(string = oil, pattern = perl("(?<=油)\\d+(?=吨)"))
# [1] 262080
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With