Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regular expression issue

Tags:

regex

r

I have a dataframe column including pages paths :

pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html

What I want to do is to extract the first number after a /, for example 123 from each row.

To solve this problem, I tried the following :

 num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */

 num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/

 num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/

 my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/

I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html

So, what I really want is to extract the first number after a /.

Any help would be very welcome.

like image 862
sarah Avatar asked Mar 13 '23 23:03

sarah


1 Answers

You can use the following regex with gsub:

"^(?:.*?/(\\d+))?.*$"

And replace with "\\1". See the regex demo.

Code:

> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123"     "15"      "25189"   "5418874" ""    

The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).

NOTE that perl=T is required.

with stringr str_extract, your code and pattern can be shortened to:

> str_extract(s, "(?<=/)\\d+")
[1] "123"     "15"      "25189"   "5418874" NA       
> 

The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).

like image 129
Wiktor Stribiżew Avatar answered Mar 15 '23 23:03

Wiktor Stribiżew