Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove urls without http in a text document using r

I am trying to remove urls that may or may not start with http/https from a large text file, which I saved in urldoc in R. The url may start like tinyurl.com/ydyzzlkk or aclj.us/2y6dQKw or pic.twitter.com/ZH08wej40K. Basically I want to remove data before a '/' after finding the space and after a "/" until I find a space. I tried with many patterns and searched many places. Couldn't complete the task. I would help me a lot if you could give some input.

This is the last statement I tried and got stuck for the above problem. urldoc = gsub("?[a-z]+\..\/.[\s]$","", urldoc)

Input would be: A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk

Output I am expecting is: A disgrace to his profession. In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. nothing like the admin. proposal:

Thanks.

like image 925
srk3124 Avatar asked Mar 25 '26 08:03

srk3124


2 Answers

According to your specs, you may use the following regex:

\s*[^ /]+/[^ /]+

See the regex demo.

Details

  • \s* - 0 or more whitespace chars
  • [^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /
  • / - a slash
  • [^ /]+ (or [^[:space:]/]) - any 1 or more chars other than space (or whitespace) and /.

R demo:

urldoc = gsub("\\s*[^ /]+/[^ /]+","", urldoc)

If you want to account for any whitespace, replace the literal space with [:space:],

urldoc = gsub("\\s*[^[:space:]/]+/[^[:space:]/]+","", urldoc)
like image 105
Wiktor Stribiżew Avatar answered Mar 27 '26 22:03

Wiktor Stribiżew


See already answered, but here is an alternative if you've not come across stringi before

# most complete package for string manipulation
library(stringi)

# text and regex
text <- "A disgrace to his profession. pic.twitter.com/ZH08wej40K In a major victory for religious liberty, the Admin. has eviscerated institution continuing this path. goo.gl/YmNELW nothing like the admin. proposal: tinyurl.com/ydyzzlkk" 
pattern <- "(?:\\s)[^\\s\\.]*\\.[^\\s]+"

# see what is captured
stringi::stri_extract_all_regex(text, pattern)

# remove (replace with "")
stringi::stri_replace_all_regex(text, pattern, "")
like image 45
Jonny Phelps Avatar answered Mar 27 '26 22:03

Jonny Phelps



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!