I am trying to clean and remove the directories from a list of URLs in R
What I have:
http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc
What I want:
domain.com
sub.domain1.com
domain2.com
I have a slightly long way to clean the beginning of the URL
url <- c("http://domain.com/123", "http://www.sub.domain1.com/222","http://www.domain2.com/1233/abc"
cleanurl <- gsub("http://","",url)
cleanurl2 <- gsub("www.","",cleanurl)
(Please let me know if there is a simpler way to clean the http:// and www. too.)
Now I am having problems with the regex and removing everything after the /
at the end.
I've tried this
cleanurl3 <- gsub("/*","",cleanurl2)
But it is just removing the /
and not everything after it.
Thanks in advance for your help!
I's approach with a strsplit
/gsub
combo (not just gsub
b/c sometimes it's so quick to figure out strsplit
as it is very intuitive):
x <- readLines(n=3)
http://domain.com/123
http://www.sub.domain1.com/222
http://www.domain2.com/1233/abc
gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))
## > gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))
## [1] "domain.com" "sub.domain1.com" "domain2.com"
Edit
Or if you want to just use strsplit
(per Matthew's suggestion):
sapply(strsplit(x, "(//|/)(www[.])?"), "[", 2)
For the first:
cleanurl <- sub("^http://(?:www[.])?(.*)$", "\\1", url)
cleanurl
## [1] "domain.com/123" "sub.domain1.com/222" "domain2.com/1233/abc"
Just the domains:
cleanurl <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", url)
cleanurl
## [1] "domain.com" "sub.domain1.com" "domain2.com"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With