Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract domain name from URL in C#

This question has answer in other languages/platforms but I couldn't find a robust solution in C#. Here I'm looking for the part of URL which we use in WHOIS so I'm not interested in sub-domains, port, schema, etc.

Example 1: http://s1.website.co.uk/folder/querystring?key=value => website.co.uk
Example 2: ftp://username:[email protected] => website.com

The result should be the same when the owner in whois is the same so sub1.xyz.com and sub2.xyz.com both belong to who has the xyz.com which I'm need to extract from a URL.

like image 432
Xaqron Avatar asked Nov 08 '10 02:11

Xaqron


2 Answers

I needed the same, so I wrote a class that you can copy and paste into your solution. It uses a hard coded string array of tld's. http://pastebin.com/raw.php?i=VY3DCNhp

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.com/path/page.htm"));

outputs microsoft.com

and

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.co.uk/path/page.htm"));

outputs microsoft.co.uk

like image 164
servermanfail Avatar answered Sep 28 '22 17:09

servermanfail


As @Pete noted, this is a little bit complicated, but I'll give it a try.

Note that this application must contain a complete list of known TLD's. These can be retrieved from http://publicsuffix.org/. Left extracting the list from this site as an exercise for the reader.

class Program
{
    static void Main(string[] args)
    {
        var testCases = new[]
        {
            "www.domain.com.ac",
            "www.domain.ac",
            "domain.com.ac",
            "domain.ac",
            "localdomain",
            "localdomain.local"
        };

        foreach (string testCase in testCases)
        {
            Console.WriteLine("{0} => {1}", testCase, UriHelper.GetDomainFromUri(new Uri("http://" + testCase + "/")));
        }

        /* Produces the following results:

            www.domain.com.ac => domain.com.ac
            www.domain.ac => domain.ac
            domain.com.ac => domain.com.ac
            domain.ac => domain.ac
            localdomain => localdomain
            localdomain.local => localdomain.local
         */
    }
}

public static class UriHelper
{
    private static HashSet<string> _tlds;

    static UriHelper()
    {
        _tlds = new HashSet<string>
        {
            "com.ac",
            "edu.ac",
            "gov.ac",
            "net.ac",
            "mil.ac",
            "org.ac",
            "ac"

            // Complete this list from http://publicsuffix.org/.
        };
    }

    public static string GetDomainFromUri(Uri uri)
    {
        return GetDomainFromHostName(uri.Host);
    }

    public static string GetDomainFromHostName(string hostName)
    {
        string[] hostNameParts = hostName.Split('.');

        if (hostNameParts.Length == 1)
            return hostNameParts[0];

        int matchingParts = FindMatchingParts(hostNameParts, 1);

        return GetPartOfHostName(hostNameParts, hostNameParts.Length - matchingParts);
    }

    private static int FindMatchingParts(string[] hostNameParts, int offset)
    {
        if (offset == hostNameParts.Length)
            return hostNameParts.Length;

        string domain = GetPartOfHostName(hostNameParts, offset);

        if (_tlds.Contains(domain.ToLowerInvariant()))
            return (hostNameParts.Length - offset) + 1;

        return FindMatchingParts(hostNameParts, offset + 1);
    }

    private static string GetPartOfHostName(string[] hostNameParts, int offset)
    {
        var sb = new StringBuilder();

        for (int i = offset; i < hostNameParts.Length; i++)
        {
            if (sb.Length > 0)
                sb.Append('.');

            sb.Append(hostNameParts[i]);
        }

        string domain = sb.ToString();
        return domain;
    }
}
like image 22
Pieter van Ginkel Avatar answered Sep 28 '22 16:09

Pieter van Ginkel