Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove the end of a URL string in R

Tags:

regex

r

I am trying to clean and remove the directories from a list of URLs in R

What I have:

http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc

What I want:

domain.com
sub.domain1.com
domain2.com

I have a slightly long way to clean the beginning of the URL

url <- c("http://domain.com/123", "http://www.sub.domain1.com/222","http://www.domain2.com/1233/abc"

cleanurl <- gsub("http://","",url)
cleanurl2 <- gsub("www.","",cleanurl)

(Please let me know if there is a simpler way to clean the http:// and www. too.)

Now I am having problems with the regex and removing everything after the / at the end. I've tried this

cleanurl3 <- gsub("/*","",cleanurl2)

But it is just removing the / and not everything after it.

Thanks in advance for your help!

like image 843
NicoM Avatar asked Mar 24 '13 19:03

NicoM


2 Answers

I's approach with a strsplit/gsub combo (not just gsub b/c sometimes it's so quick to figure out strsplit as it is very intuitive):

x <- readLines(n=3)
http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc

gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))

## > gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))
## [1] "domain.com"      "sub.domain1.com" "domain2.com"

Edit
Or if you want to just use strsplit (per Matthew's suggestion):

sapply(strsplit(x, "(//|/)(www[.])?"), "[", 2)
like image 73
Tyler Rinker Avatar answered Oct 15 '22 07:10

Tyler Rinker


For the first:

cleanurl <- sub("^http://(?:www[.])?(.*)$", "\\1", url)
cleanurl
## [1] "domain.com/123"       "sub.domain1.com/222"  "domain2.com/1233/abc"

Just the domains:

cleanurl <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", url)
cleanurl
## [1] "domain.com"      "sub.domain1.com" "domain2.com" 
like image 5
Matthew Lundberg Avatar answered Oct 15 '22 06:10

Matthew Lundberg