I'm looking for a function which can extract the domain name from a URL in R. Any function which is similar to tldextract in R? EDIT: Currently i'm using the below approach: <pre class="prettyprint"><code>domain=substr(as.character("www.google.com"), which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com")) </code></pre> But i'm looking for a pre-defined function which can save the coding effort.

I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to <code>source</code> later or make your own package with it. <pre class="prettyprint"><code>x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r" x2 <- "http://www.talkstats.com/" x3 <- "www.google.com" domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]] domain(x3) sapply(list(x1, x2, x3), domain) ## [1] "stackoverflow.com" "talkstats.com" "google.com" </code></pre>

Function to extract domain name from URL in R

Tags:

r

I'm looking for a function which can extract the domain name from a URL in R. Any function which is similar to tldextract in R? EDIT: Currently i'm using the below approach:

domain=substr(as.character("www.google.com"), 
   which(strsplit("www.google.com",'')[[1]]=='.')[1]+1, nchar("www.google.com"))

But i'm looking for a pre-defined function which can save the coding effort.

851

asked Sep 26 '13 06:09

Prasun Velayudhan

4 Answers

You can also use the relatively new urltools package:

library(urltools)

URLs <- c("http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r",
          "http://www.talkstats.com/", "www.google.com")

suffix_extract(domain(URLs))

##                host subdomain        domain suffix
## 1 stackoverflow.com      <NA> stackoverflow    com
## 2 www.talkstats.com       www     talkstats    com
## 3    www.google.com       www        google    com

It's backed by Rcpp so it's wicked fast (significantly more so than using built- in R apply functions.

100

answered Oct 16 '22 10:10

hrbrmstr

I don't know of a function in a package to do this. I don't think there's anything in base install of R. Use a user defined function and store it some where to source later or make your own package with it.

x1 <- "http://stackoverflow.com/questions/19020749/function-to-extract-domain-name-from-url-in-r"
x2 <- "http://www.talkstats.com/"
x3 <- "www.google.com"

domain <- function(x) strsplit(gsub("http://|https://|www\\.", "", x), "/")[[c(1, 1)]]

domain(x3)
sapply(list(x1, x2, x3), domain)
## [1] "stackoverflow.com" "talkstats.com"     "google.com"

answered Oct 16 '22 09:10

Tyler Rinker

I just wrote this regex which can be applied to emails and websites for extracting and matching on domain. The regex could be modified to extract different parts, and it is vectorized. I do some additional processing which is totally optional.

invalid_domains = "yahoo.com|aol.com|gmail.com|hotmail.com|comcast.net|me.com|att.net|verizon.net|live.com|sbcglobal.net|msn.com|outlook.com|ibm.com"
domain <- function(x){
  to_return <- tolower(as.character(x))
  to_return <- gsub('.*[.@/]+([^.@:/]+[.][a-z]+)(/.*$|$)','\\1',x,ignore.case=T) # extract domain. \\1 selects just the first group.
  to_return <- gsub(".ocm", ".com", to_return) # correct mispellings
  # to_return <- ifelse(grepl(invalid_domains,to_return),NA,to_return) # (optional) check for invalid domains, especially when working with emails.
  to_return <- ifelse(grepl('[.]',to_return),to_return,NA) # if there is no . this is an invalid domain, return NA
  return(to_return)
}

answered Oct 16 '22 08:10

Bryce Chamberlain

A vectorized option using base R only and customizable super easily in its output could be

url_regexpr <- function() {
  protocol <- "([^/]+://)*"  # could be
  sub <- "([^\\.\\?/]+\\.)*"  # could be
  domain <- "([^\\.\\?/]+)"  # must be
  dot <- "(\\.)"  # must be
  suffix <- "([^/]+)"  # must be
  folders <- "(/[^\\?]*)*"  # could be
  args <- "(\\?.*)*"  #could be

  paste0(
    "^",
    protocol, sub, domain, dot, suffix, folders, args,
    "$"
  )
}

get_domain <- function(url, include_suffix = TRUE) {
  res <- paste0("\\3", c("\\4\\5")[include_suffix])
  sub(url_regexpr(), res, url)
}

I have run the following tests on it:

library(testthat)
test_that("get_domain works", {
  expect_equal(get_domain("https://www.example.com"), "example.com")
  expect_equal(get_domain("http://www.example.com"), "example.com")
  expect_equal(get_domain("www.example.com"), "example.com")

  expect_equal(get_domain("www.example.net"), "example.net")
  expect_equal(get_domain("www.example.net/baz"), "example.net")

  expect_equal(get_domain("https://www.example.net/baz"), "example.net")
  expect_equal(get_domain("https://www.example.net/baz/tar"), "example.net")

  expect_equal(get_domain("https://foo.example.net"), "example.net")
  expect_equal(get_domain("https://www.foo.example.net"), "example.net")
})

test_that("get_domain is vectorized", {
  urls <- c("www.example.com", "www.example.net")
  expect_equal(get_domain(urls), c("example.com", "example.net"))
})


test_that("can remove suffix", {
  expect_equal(
    get_domain("https://www.example.com", include_suffix = FALSE),
    "example"
  )
})

test_that("works with file extensions", {
  expect_equal(
    get_domain("https://www.example.com/foo.php"),
    "example.com"
  )
})

test_that("works against leading slash", {
  expect_equal(
    get_domain("http://m.example.com/"),
    "example.com"
  )
})

test_that("works against args after slash", {
  expect_equal(
    get_domain("http://example.com/?"),
    "example.com"
  )
})

test_that("works against multiple dots after slash", {
  expect_equal(
    get_domain("http://example.com/foo.net.bar"),
    "example.com"
  )
})

test_that("generalized protocols", {
  expect_equal(
    get_domain("android-app://example.com/foo.net.bar"),
    "example.com"
  )
})

answered Oct 16 '22 10:10

Corrado

Related questions
                            
                                How to compute ROC and AUC under ROC after training using caret in R?
                            
                                What is “object of type ‘closure’ is not subsettable” error in Shiny?
                            
                                if not conditions in R?
                            
                                What's the R equivalent of SQL's LIKE 'description%' statement?
                            
                                apply a function over groups of columns
                            
                                Subset rows according to a range of time
                            
                                R: sample() command subject to a constraint
                            
                                Select row with most recent date by group
                            
                                Add a column with count of NAs and Mean
                            
                                Shinydashboard: Is it not possible to have nested menu sub items? Can't make it work
                            
                                Storing R Objects in a relational database
                            
                                How to change correlation text size in ggpairs()
                            
                                Calculating R^2 for a nonlinear least squares fit
                            
                                Check if string contains ONLY NUMBERS or ONLY CHARACTERS (R)
                            
                                Draggable line chart in R/Shiny
                            
                                Intersecting Points and Polygons in R
                            
                                Overwrite current output in the R console
                            
                                Source-ing an .R script within a function and passing a variable through (RODBC)
                            
                                Check if R package is installed then load library
                            
                                Image smoothing in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With