Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regexpr syntax in R

Tags:

regex

r

I am trying the following which should allow me to get everything between productUrl:// and the following ?

(?<=\"productUrl\"\:\"\/\/)(.*?)(?=\?)

The above works on https://regexr.com/

I am then trying to escape the backslashes to fit that string into the grep function but with no luck. What is the proper way of doing it ?

See this example: link to example

I actually need to extract the substrings that match my pattern so grep may be used in conjunction with another function.

like image 953
Chapo Avatar asked Feb 03 '23 20:02

Chapo


1 Answers

Note you do not need to escape / in R regex patterns as they are defined with string literals and / is not a special regex metacharacter. If you want to write a " inside "..." string literal, you should escape it with a single \, as you are already doing.

You may avoid overescaping here if you use single quotes to define the string literal and if you turn .*?(?=\?) into a negated character class:

grep('(?<="productUrl":"//)([^?]*)', x, perl=TRUE)

The [^?]* negated character class matches any 0 or more chars other than ?.

If the string you are checking against has no double quotes remove them from the lookbehind:

grep('(?<=productUrl://)([^?]*)', x, perl=TRUE)

Instead of the lookbehind, you may also use \K to omit the part of text matched:

grep('productUrl://\\K[^?]*', x, perl=TRUE)
                   ^^^ 

Actually, you do not even need the capturing group in your pattern.

Solving the actual task

You cannot extract substrings with grep in R, you can only find/identify elements to fetch from a character vector using grep. To extract substrings, you need to use base R regmatches or stringr str_extract/str_extract_all or similar match functions.

Example with base R:

> x <- '":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"","pos":0},"listItems":[{"name":"BRAND\'S® Lutein Essence 6 Bottles x 60ml","nid":"66765568","icons":[{"domClass":"lazMall","text":"LazMall","alias":"LazMallAlias","type":"img","group":"1","showType":"0","order":0}],\n"productUrl":"//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html?search=1","image":"https://sg-test-11.slatic.net/p/5337f879236ece2f14158c055adcdef7.jpg",\n"productUrl":"//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html?search=1","sku":"BR924HBAB3R0N4SGAMZ","skuId":"167303363"}],"restrictedAge":0,"categories":[1438,1565,4776,7305'
> regmatches(x, gregexpr('"productUrl":"\\K[^?"]*', x, perl=TRUE))
[[1]]
[1] "//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html"
[2] "//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html"

With stringr:

> library(stringr)
> str_extract_all(x, '(?<="productUrl":")[^?"]*')
[[1]]
[1] "//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html"
[2] "//www.lazada.sg/products/brands-lutein-essence-6-bottles-x-60ml-i138897006-s167303363.html"
like image 122
Wiktor Stribiżew Avatar answered Feb 06 '23 09:02

Wiktor Stribiżew