For example:
http://www.google.co.uk
www.google.co.uk
google.co.uk
will all be converted to:
google.co.uk
I would have liked to use the System.Uri class but this only seems to accept urls with a scheme.
The UriBuilder
class normalises URLs and handles many edge cases like a missing scheme. This makes it easy to extract the domain name. For example, these all give you www.google.co.uk
:
new UriBuilder("www.google.co.uk").Host
new UriBuilder("http://www.google.co.uk").Host
new UriBuilder("ftp://www.google.co.uk:21/some/path").Host
www.
is hardThe problem seems easy, but it's not. You can't reliably remove subdomains like www
because there's no real distinction. The domain is www.google.co.uk
, including www
. There's nothing special about co.uk
that makes google
part of the domain and www
not part of it — it just happens that co.uk
is managed by the registrar, and google.co.uk
is managed by Google.
To give you an idea of the problem, here's an incomplete list of domain suffixes which includes nearly 7100 entries so far. Notably, which part is which isn't even consistent:
URL the domain you want --------------------- ------------------- http://www.crews.aero crews.aero http://www.crew.aero www.crew.aero
The best approach would be what Google itself does for Chrome's omnibar: fetch the (incomplete) list of domain suffixes, cache it temporarily, and compare domain names against the list of domain suffixes. You can see the result for yourself: type "crews.aero" in the Chrome omnibar and it will be treated as a URL, or type "crew.aero" and it will be treated as a search.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With