Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is best way to normalize an URI to extract just the domain name?

Tags:

c#

.net

uri

For example:

http://www.google.co.uk
www.google.co.uk
google.co.uk

will all be converted to:

google.co.uk

I would have liked to use the System.Uri class but this only seems to accept urls with a scheme.

like image 936
jaffa Avatar asked Jan 16 '23 04:01

jaffa


1 Answers

Extracting the domain name is easy

The UriBuilder class normalises URLs and handles many edge cases like a missing scheme. This makes it easy to extract the domain name. For example, these all give you www.google.co.uk:

new UriBuilder("www.google.co.uk").Host
new UriBuilder("http://www.google.co.uk").Host
new UriBuilder("ftp://www.google.co.uk:21/some/path").Host

...but removing www. is hard

The problem seems easy, but it's not. You can't reliably remove subdomains like www because there's no real distinction. The domain is www.google.co.uk, including www. There's nothing special about co.uk that makes google part of the domain and www not part of it — it just happens that co.uk is managed by the registrar, and google.co.uk is managed by Google.

To give you an idea of the problem, here's an incomplete list of domain suffixes which includes nearly 7100 entries so far. Notably, which part is which isn't even consistent:

URL                     the domain you want
---------------------   -------------------
http://www.crews.aero   crews.aero
http://www.crew.aero    www.crew.aero

The best approach would be what Google itself does for Chrome's omnibar: fetch the (incomplete) list of domain suffixes, cache it temporarily, and compare domain names against the list of domain suffixes. You can see the result for yourself: type "crews.aero" in the Chrome omnibar and it will be treated as a URL, or type "crew.aero" and it will be treated as a search.

like image 66
Pathoschild Avatar answered Jan 26 '23 01:01

Pathoschild