I'm looking for a function which can extract the domain name from a URL in R. Any function which is similar to tldextract in R? EDIT: Currently i'm using the below approach:
domain=substr(as.character("www.google.com"),
which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com"))
But i'm looking for a pre-defined function which can save the coding effort.
PHP's parse_url function makes it easy to extract the domain, path and other useful bits of information from a full URL.
You can also use the relatively new urltools
package:
library(urltools)
URLs <- c("http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r",
"http://www.talkstats.com/", "www.google.com")
suffix_extract(domain(URLs))
## host subdomain domain suffix
## 1 stackoverflow.com <NA> stackoverflow com
## 2 www.talkstats.com www talkstats com
## 3 www.google.com www google com
It's backed by Rcpp
so it's wicked fast (significantly more so than using built- in R apply
functions.
I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to source
later or make your own package with it.
x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"
domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]
domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com" "google.com"
I just wrote this regex which can be applied to emails and websites for extracting and matching on domain. The regex could be modified to extract different parts, and it is vectorized. I do some additional processing which is totally optional.
invalid_domains = "yahoo.com|aol.com|gmail.com|hotmail.com|comcast.net|me.com|att.net|verizon.net|live.com|sbcglobal.net|msn.com|outlook.com|ibm.com"
domain <- function(x){
to_return <- tolower(as.character(x))
to_return <- gsub('.*[.@/]+([^.@:/]+[.][a-z]+)(/.*$|$)','\\1',x,ignore.case=T) # extract domain. \\1 selects just the first group.
to_return <- gsub(".ocm", ".com", to_return) # correct mispellings
# to_return <- ifelse(grepl(invalid_domains,to_return),NA,to_return) # (optional) check for invalid domains, especially when working with emails.
to_return <- ifelse(grepl('[.]',to_return),to_return,NA) # if there is no . this is an invalid domain, return NA
return(to_return)
}
A vectorized option using base R only and customizable super easily in its output could be
url_regexpr <- function() {
protocol <- "([^/]+://)*" # could be
sub <- "([^\\.\\?/]+\\.)*" # could be
domain <- "([^\\.\\?/]+)" # must be
dot <- "(\\.)" # must be
suffix <- "([^/]+)" # must be
folders <- "(/[^\\?]*)*" # could be
args <- "(\\?.*)*" #could be
paste0(
"^",
protocol, sub, domain, dot, suffix, folders, args,
"$"
)
}
get_domain <- function(url, include_suffix = TRUE) {
res <- paste0("\\3", c("\\4\\5")[include_suffix])
sub(url_regexpr(), res, url)
}
I have run the following tests on it:
library(testthat)
test_that("get_domain works", {
expect_equal(get_domain("https://www.example.com"), "example.com")
expect_equal(get_domain("http://www.example.com"), "example.com")
expect_equal(get_domain("www.example.com"), "example.com")
expect_equal(get_domain("www.example.net"), "example.net")
expect_equal(get_domain("www.example.net/baz"), "example.net")
expect_equal(get_domain("https://www.example.net/baz"), "example.net")
expect_equal(get_domain("https://www.example.net/baz/tar"), "example.net")
expect_equal(get_domain("https://foo.example.net"), "example.net")
expect_equal(get_domain("https://www.foo.example.net"), "example.net")
})
test_that("get_domain is vectorized", {
urls <- c("www.example.com", "www.example.net")
expect_equal(get_domain(urls), c("example.com", "example.net"))
})
test_that("can remove suffix", {
expect_equal(
get_domain("https://www.example.com", include_suffix = FALSE),
"example"
)
})
test_that("works with file extensions", {
expect_equal(
get_domain("https://www.example.com/foo.php"),
"example.com"
)
})
test_that("works against leading slash", {
expect_equal(
get_domain("http://m.example.com/"),
"example.com"
)
})
test_that("works against args after slash", {
expect_equal(
get_domain("http://example.com/?"),
"example.com"
)
})
test_that("works against multiple dots after slash", {
expect_equal(
get_domain("http://example.com/foo.net.bar"),
"example.com"
)
})
test_that("generalized protocols", {
expect_equal(
get_domain("android-app://example.com/foo.net.bar"),
"example.com"
)
})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With