Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if it is 404 error page(page does not exist) using HtmlAgilityPack

Here i am trying to read urls and getting the images in a page. I need to exclude the page if it is 404 and stop getting the images from a 404 error page. How to do it using HtmlAgilityPack? Here is my code

var document = new HtmlWeb().Load(completeurl);
var urls = document.DocumentNode.Descendants("img")
          .Select(e => e.GetAttributeValue("src", null))
          .Where(s => !String.IsNullOrEmpty(s)).ToList();
like image 424
bala3569 Avatar asked Jan 09 '16 14:01

bala3569


2 Answers

You'll need to register a PostRequestHandler event on the HtmlWeb instance, it will be raised after each downloaded document and you'll get access to the HttpWebResponse object. It has a property for the StatusCode.

 HtmlWeb web = new HtmlWeb();
 HttpStatusCode statusCode = HttpStatusCode.OK;
 web.PostRequestHandler += (request, response) =>
 {
     if (response != null)
     {
         statusCode = response.StatusCode;
     }
 }

 var doc = web.Load(completeUrl)
 if (statusCode == HttpStatusCode.OK)
 {
     // received a read document
 }

Looking at the code of the HtmlAgilityPack on GitHub, it's even simpler, HtmlWeb has a property StatusCode which is set with the value:

var web = new HtmlWeb();
var document = web.Load(completeurl);

if (web.StatusCode == HttpStatusCode.OK)
{
    var urls = document.DocumentNode.Descendants("img")
          .Select(e => e.GetAttributeValue("src", null))
          .Where(s => !String.IsNullOrEmpty(s)).ToList();
}

Update

There has been an update to the AgilityPack API. The trick is still the same:

var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;

htmlWeb.PostResponse = (request, response) =>
{
    if (response != null)
    {
        lastStatusCode = response.StatusCode;
    }
};
like image 141
jessehouwing Avatar answered Nov 03 '22 18:11

jessehouwing


Be aware of the version you use!

I am using HtmlAgilityPack v1.5.1 and there is no PostRequestHandler event.

In the v1.5.1 one has to use PostResponse field. See example below.

var htmlWeb = new HtmlWeb();
var lastStatusCode = HttpStatusCode.OK;

htmlWeb.PostResponse = (request, response) =>
{
    if (response != null)
    {
        lastStatusCode = response.StatusCode;
    }
};

There are not many differences but still they are.

Hope this will save some time to someone.

like image 45
Roman Zinnatov Avatar answered Nov 03 '22 18:11

Roman Zinnatov