Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HtmlAgilityPack.HtmlDocument Cookies

This pertains to cookies set inside a script (maybe inside a script tag).

System.Windows.Forms.HtmlDocument executes those scripts and the cookies set (like document.cookie=etc...) can be retrieved through its Cookies property.

I assume HtmlAgilityPack.HtmlDocument doesn't do this (execution). I wonder if there is an easy way to emulate the System.Windows.Forms.HtmlDocument capabilities (the cookies part).

Anyone?

like image 771
Jojo Avatar asked Apr 06 '11 07:04

Jojo


2 Answers

When I need to use Cookies and HtmlAgilityPack together, or just create custom requests (for example, set the User-Agent property, etc), here is what I do:

  • Create a class that encapsulates the request/response. Let's call this class WebQuery
  • Have a private CookieCollection (in your case public) property inside that class
  • Create a method inside the class that does manually the request. The signature could be:

...

public HtmlAgilityPack.HtmlDocument GetSource(string url);

What do we need to do inside this method?

Well, using HttpWebRequest and HttpWebResponse, generate the http request manually (there are several examples of how to do this on Internet), create an instance of a HtmlDocument class using the constructor that receives an stream.

What stream do we have to use? Well, the one returned by:

httpResponse.GetResponseStream();

If you use HttpWebRequest to make the query, you can easily set the CookieContainer property of it to the variable you declared before everytime you access a new page, and that way all cookies set by the sites you access will be properly stored in the CookieContainer variable you declared in your WebQuery class, taking in count you're using only one instance of the WebQuery class.

Hope you find useful this explanation. Take in count that using this, you can do whatever you want, no matter if HtmlAgilityPack supports it or not.

like image 122
Oscar Mederos Avatar answered Sep 21 '22 19:09

Oscar Mederos


I also worked with Rohit Agarwal's BrowserSession class together with HtmlAgilityPack. But for me subsequent calls of the "Get-function" didn't work, because every time new cookies have been set. That's why I added some functions by my own. (My solution is far a way from beeing perfect - it's just a quick and dirty fix) But for me it worked and if you don't want to spent a lot of time in investigating BrowserSession class here is what I did:

The added/modified functions are the following:

class BrowserSession{
   private bool _isPost;
   private HtmlDocument _htmlDoc;
   public CookieContainer cookiePot;   //<- This is the new CookieContainer

 ...

    public string Get2(string url)
    {
        HtmlWeb web = new HtmlWeb();
        web.UseCookies = true;
        web.PreRequest = new HtmlWeb.PreRequestHandler(OnPreRequest2);
        web.PostResponse = new HtmlWeb.PostResponseHandler(OnAfterResponse2);
        HtmlDocument doc = web.Load(url);
        return doc.DocumentNode.InnerHtml;
    }
    public bool OnPreRequest2(HttpWebRequest request)
    {
        request.CookieContainer = cookiePot;
        return true;
    }
    protected void OnAfterResponse2(HttpWebRequest request, HttpWebResponse response)
    {
        //do nothing
    }
    private void SaveCookiesFrom(HttpWebResponse response)
    {
        if ((response.Cookies.Count > 0))
        {
            if (Cookies == null)
            {
                Cookies = new CookieCollection();
            }    
            Cookies.Add(response.Cookies);
            cookiePot.Add(Cookies);     //-> add the Cookies to the cookiePot
        }
    }

What it does: It basically saves the cookies from the initial "Post-Response" and adds the same CookieContainer to the request called later. I do not fully understand why it was not working in the initial version because it somehow does the same in the AddCookiesTo-function. (if (Cookies != null && Cookies.Count > 0) request.CookieContainer.Add(Cookies);) Anyhow, with these added functions it should work fine now.

It can be used like this:

//initial "Login-procedure"
BrowserSession b = new BrowserSession();
b.Get("http://www.blablubb/login.php");
b.FormElements["username"] = "yourusername";
b.FormElements["password"] = "yourpass";
string response = b.Post("http://www.blablubb/login.php");

all subsequent calls should use:

response = b.Get2("http://www.blablubb/secondpageyouwannabrowseto");
response = b.Get2("http://www.blablubb/thirdpageyouwannabrowseto");
...

I hope it helps when you're facing the same problem.

like image 29
funkypopcorn Avatar answered Sep 22 '22 19:09

funkypopcorn