Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to extract Favicon url from a webpage

Please help me to find the Favicon url from the sample html below using Regular expression. It should also check for file extension ".ico". I am developing a personal bookmarking site and i want to save the favicons of links which i bookmark. I have already written the c# code to convert icon to gif and save but i have very limited knowledge about regex so i am unable to select this tag because ending tags are different in different sites . Example of ending tags "/>" "/link>"

My programming language is C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

solution: one more way to do this Download and add reference to htmlagilitypack dll. Thanks for helping me. I really love this site :)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }
like image 927
ziaasp Avatar asked Oct 10 '22 15:10

ziaasp


2 Answers

This should match the whole link tag that contain href=http://3dbin.com/favicon.ico

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

Correction based on your comment:

I see you have a C# solutions Excellent! But just in case you were still wondering if it could be done with regular expressions the following expression would do what you want. The group 1 of the match will have only the url.

 <link .*? href="(.*?.ico)"

Simple C# snipet that makes use of it:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

which prints the following to the console:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico
like image 140
Rob Avatar answered Oct 20 '22 06:10

Rob


<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

maybe... it is not robust, but could work. (I used perl regex)

like image 39
ShinTakezou Avatar answered Oct 20 '22 07:10

ShinTakezou