Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting top-level and second-level domain from a URL using regex

Tags:

regex

url

dns

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

like image 727
mel Avatar asked Jan 16 '14 21:01

mel


Video Answer


3 Answers

Here's my idea,

Match anything that isn't a dot, three times, from the end of the line using the $ anchor.

The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.

Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.


Regex:

[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$


Demonstration:

Regex101 Example

like image 143
Vasili Syrakis Avatar answered Oct 08 '22 19:10

Vasili Syrakis


Updated 2019

This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

There are several open-source libraries out there that you can use, like psl, or you can write your own.

Usage for psl is quite intuitive. From their docs:

var psl = require('psl');

// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null

// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'

// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'

Old answer

You could use this:

(\w+\.\w+)$

Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

Example: http://regex101.com/r/wD8eP2

like image 15
brandonscript Avatar answered Oct 08 '22 19:10

brandonscript


Also, you can likely do that with some expression similar to,

^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$

and add as much as capturing groups that you want to capture the components of a URL.

Demo


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

like image 4
Emma Avatar answered Oct 08 '22 21:10

Emma