Extracting top-level and second-level domain from a URL using regex

Question

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

Vasili Syrakis · Accepted Answer

Here's my idea,

Match anything that isn't a dot, three times, from the end of the line using the $ anchor.

The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.

Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.

Regex:

[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$

Demonstration:

Regex101 Example

brandonscript · Answer

Updated 2019

This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

There are several open-source libraries out there that you can use, like psl, or you can write your own.

Usage for psl is quite intuitive. From their docs:

var psl = require('psl');

// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null

// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'

// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'

Old answer

You could use this:

(\w+\.\w+)$

Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

Example: http://regex101.com/r/wD8eP2

Emma · Answer

Also, you can likely do that with some expression similar to,

^(?:https?://)(?:w{3}\.)?.*?([^.
/]+\.)([^.
/]+\.[^.
/]{2,6}(?:\.[^.
/]{2,6})?).*$

and add as much as capturing groups that you want to capture the components of a URL.

Demo

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Extracting top-level and second-level domain from a URL using regex

Tags:

regex

url

dns

mel

Video Answer

3 Answers

Vasili Syrakis

Updated 2019

brandonscript

Demo

RegEx Circuit

Emma

Recent Activity

Donate For Us

Extracting top-level and second-level domain from a URL using regex

Tags:

regex

url

dns

mel

Video Answer

3 Answers

Vasili Syrakis

Updated 2019

brandonscript

Demo

RegEx Circuit

Emma

Related questions

Recent Activity

Donate For Us