Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract the "domain" from an email address

Tags:

regex

r

I have following pattern in my column

[email protected]
[email protected]

Now, I want to extract text after @ and before . i.e gmail and hotmail .I am able to extract text after . with following code.

sub(".*@", "", email)

How can I modify above to fit in my use case?

like image 866
Neil Avatar asked Oct 14 '16 08:10

Neil


2 Answers

You:

  1. really need to read Section 3 of RFC 3696 (TLDR: the @ can appear in multiple places)
  2. seem to not have considered that an email can be "[email protected]", "[email protected]" (i.e. naively assuming only a domain could come back to bite you at some point in this analysis)
  3. should be aware that if you're really looking for the email "domain name" then you also have to consider what really constitutes a domain name and a proper suffix.

So — unless you know for sure that you have and always will have simple email addresses — might I suggest:

library(stringi)
library(urltools)
library(dplyr)
library(purrr)

emails <- c("[email protected]", "[email protected]",
            "[email protected]",
            "[email protected]",
            "[email protected]")

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_df(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract()
  })
##                         host    subdomain      domain suffix
## 1                  gmail.com         <NA>       gmail    com
## 2                hotmail.com         <NA>     hotmail    com
## 3      deparment.example.com   department     example    com
## 4 yet.another.department.com  yet.another  department    com
## 5             froodyco.co.uk         <NA>   froodyorg  co.uk

Note the proper splitting of subdomain, domain & suffix, especially for the last one.

Knowing this, we can then change the code to:

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_chr(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract() %>%
      mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
      select(full_domain) %>%
      flatten_chr()
  })
## [1] "gmail"                   "hotmail"               
## [3] "department.example"      "yet.another.department"
## [5] "froodyorg"
like image 109
hrbrmstr Avatar answered Sep 18 '22 02:09

hrbrmstr


We can use gsub

gsub(".*@|\\..*", "", email)
#[1] "gmail"   "hotmail"
like image 35
akrun Avatar answered Sep 17 '22 02:09

akrun