Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract full url with HtmlAgilityPack - C#

Alright with the way below it is extracting only referring url like this

the extraction code :

foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
    lsLinks.Add(link.Attributes["href"].Value.ToString());
}

The url code

<a href="Login.aspx">Login</a>

The extracted url

Login.aspx

But i want to get real link what browser parsed like

http://www.monstermmorpg.com/Login.aspx

I can do it with checking the url whether containing http and if not add the domain value but it may cause some problems at some occasions and i think not a very wise solution.

c# 4.0 , HtmlAgilityPack.1.4.0

like image 737
MonsterMMORPG Avatar asked Oct 13 '11 20:10

MonsterMMORPG


1 Answers

Assuming you have the original url, you can combine the parsed url something like this:

// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");

// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'

// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'

// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'

// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'

Note the use of AbsoluteUri and not relying on ToString() because ToString decodes the URL (to make it more "human-readable"), which is not typically what you want.

like image 173
Duncan Smart Avatar answered Sep 21 '22 22:09

Duncan Smart