R regular expression issue

Question

I have a dataframe column including pages paths :

pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html

What I want to do is to extract the first number after a /, for example 123 from each row.

To solve this problem, I tried the following :

 num = gsub("\D"," ", mydata$pagePath) /*to delete all characters other than digits */

 num1 = gsub("\s+"," ",num) /*to let only one space between numbers*/

 num2 = gsub("^\s","",num1) /*to delete the first space in my string*/

 my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/

I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html

So, what I really want is to extract the first number after a /.

Any help would be very welcome.

Wiktor Stribiżew · Accepted Answer

You can use the following regex with gsub:

"^(?:.*?/(\d+))?.*$"

And replace with "\1". See the regex demo.

Code:

> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\d+))?.*$", "\1", s, perl=T)
[1] "123"     "15"      "25189"   "5418874" ""

The regex will match optionally (with a (?:.*?/(\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\d+)) and then the rest of the string up to its end (with .*$).

NOTE that perl=T is required.

with stringr str_extract, your code and pattern can be shortened to:

> str_extract(s, "(?<=/)\d+")
[1] "123"     "15"      "25189"   "5418874" NA       
>

The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).

R regular expression issue

Tags:

regex

r

sarah

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

R regular expression issue

Tags:

regex

r

sarah

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us