Alright with the way below it is extracting only referring url like this
the extraction code :
foreach (HtmlNode link in hdDoc.DocumentNode.SelectNodes("//a[@href]"))
{
lsLinks.Add(link.Attributes["href"].Value.ToString());
}
The url code
<a href="Login.aspx">Login</a>
The extracted url
Login.aspx
But i want to get real link what browser parsed like
http://www.monstermmorpg.com/Login.aspx
I can do it with checking the url whether containing http and if not add the domain value but it may cause some problems at some occasions and i think not a very wise solution.
c# 4.0 , HtmlAgilityPack.1.4.0
Assuming you have the original url, you can combine the parsed url something like this:
// The address of the page you crawled
var baseUrl = new Uri("http://example.com/path/to-page/here.aspx");
// root relative
var url = new Uri(baseUrl, "/Login.aspx");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/Logon.aspx'
// relative
url = new Uri(baseUrl, "../foo.aspx?q=1");
Console.WriteLine (url.AbsoluteUri); // prints 'http://example.com/path/foo.aspx?q=1'
// absolute
url = new Uri(baseUrl, "http://stackoverflow.com/questions/7760286/");
Console.WriteLine (url.AbsoluteUri); // prints 'http://stackoverflow.com/questions/7760286/'
// other...
url = new Uri(baseUrl, "javascript:void(0)");
Console.WriteLine (url.AbsoluteUri); // prints 'javascript:void(0)'
Note the use of AbsoluteUri
and not relying on ToString()
because ToString
decodes the URL (to make it more "human-readable"), which is not typically what you want.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With