I have a dataframe column including pages paths :
pagePath
/text/other_text/123-some_other_txet-4571/text.html
/text/other_text/another_txet/15-some_other_txet.html
/text/other_text/25189-some_other_txet/45112-text.html
/text/other_text/text/text/5418874-some_other_txet.html
/text/other_text/text/text/some_other_txet-4157/text.html
What I want to do is to extract the first number after a /, for example 123 from each row.
To solve this problem, I tried the following :
num = gsub("\\D"," ", mydata$pagePath) /*to delete all characters other than digits */
num1 = gsub("\\s+"," ",num) /*to let only one space between numbers*/
num2 = gsub("^\\s","",num1) /*to delete the first space in my string*/
my_number = gsub( " .*$", "", num2 ) /*to select the first number on my string*/
I thought that what's that I wanted, but I had some troubles, especially with rows like the last row in the example : /text/other_text/text/text/some_other_txet-4157/text.html
So, what I really want is to extract the first number after a /.
Any help would be very welcome.
You can use the following regex with gsub:
"^(?:.*?/(\\d+))?.*$"
And replace with "\\1". See the regex demo.
Code:
> s <- c("/text/other_text/123-some_other_txet-4571/text.html", "/text/other_text/another_txet/15-some_other_txet.html", "/text/other_text/25189-some_other_txet/45112-text.html", "/text/other_text/text/text/5418874-some_other_txet.html", "/text/other_text/text/text/some_other_txet-4157/text.html")
> gsub("^(?:.*?/(\\d+))?.*$", "\\1", s, perl=T)
[1] "123" "15" "25189" "5418874" ""
The regex will match optionally (with a (?:.*?/(\\d+))? subpattern) a part of string from the beginning till the first / (with .*?/) followed with 1 or more digits (capturing the digits into Group 1, with (\\d+)) and then the rest of the string up to its end (with .*$).
NOTE that perl=T is required.
with stringr str_extract, your code and pattern can be shortened to:
> str_extract(s, "(?<=/)\\d+")
[1] "123" "15" "25189" "5418874" NA
>
The str_extract will extract the first 1 or more digits if they are preceded with a / (the / itself is not returned as part of the match since it is a lookbehind subpattern, a zero width assertion, that does not put the matched text into the result).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With