Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex with Chinese characters

Tags:

regex

r

stringr

I'm searching text_ which is: 本周(3月25日-3月31日),国内油厂开机率继续下降,全国各地油厂大豆压榨总量1456000吨(出粕1157520吨,出油262080吨),较上周的...[continued]

  crush <- str_extract(string = text_, pattern = perl("(?<=量).*(?=吨(出粕)"))
  meal <- str_extract(string = text_, pattern = perl("(?<=粕).*(?=吨,出)"))
  oil <-  str_extract(string = text_, pattern = perl("(?<=出油).*(?=吨))"))

prints

[1] "1456000"   ## correct
[1] "1157520"   ## correct
[1] NA          ## looking for 262080 here

Why do the first two match but not the last one? I'm using the stringr library.

like image 874
Rafael Avatar asked Oct 17 '22 14:10

Rafael


2 Answers

Note that current version of stringr package is based on ICU regex library, and using perl() is deprecated.

Note that lookbehind patterns are fixed-width, and it seems that there is a problem with how ICU parses the first letter in your lookbehind pattern (it cannot calculate its width for some unknown reason).

Since you are using stringr, you may just rely on capturing that can be achieved with str_match, to extract a part of the pattern:

> match <- str_match(s, "出油(\\d+)吨")
> match[,2]
[1] "262080"

This way, you will avoid any eventual issues in the future. Also, these regexps are executed faster since there is no unanchored lookbehind in the pattern that is executed at every location in the searched string.

Also, you may just use your PCRE regex with base R:

> regmatches(s, regexpr("(?<=出油)\\d+(?=吨)", s, perl=TRUE))
[1] "262080"
like image 183
Wiktor Stribiżew Avatar answered Nov 03 '22 19:11

Wiktor Stribiżew


For some reason, still don't know, I wasn't able to use @WiktorStribiżew 's commented solution, but this ended up working:

oil <-  str_extract(string = text_, pattern = perl("(?<=吨).*(?=吨)"))
# [1] "(出粕1157520吨,出油262080吨),较
oil <- str_extract(string = oil, pattern = perl("(?<=油)\\d+(?=吨)"))
# [1] 262080
like image 23
Rafael Avatar answered Nov 03 '22 19:11

Rafael