Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Access the DOM using WebBrowser

I need to access the DOM of the HTML document after executing javascript on the page. I have the below code which connects to the URL and gets the document. The problem is that it never get the DOM after modified with javascript

public class CustomBrowser
{
    public CustomBrowser()
    {
        //
        // TODO: Add constructor logic here
        //
    }

    protected string _url;
    string html = "";
    WebBrowser browser;

    public string GetWebpage(string url)
    {
        _url = url;
        // WebBrowser is an ActiveX control that must be run in a
        // single-threaded apartment so create a thread to create the
        // control and generate the thumbnail
        Thread thread = new Thread(new ThreadStart(GetWebPageWorker));
        thread.SetApartmentState(ApartmentState.STA);
        thread.Start();
        thread.Join();
        string s = html;
        return s;
    }

    protected void GetWebPageWorker()
    {
        browser = new WebBrowser();
        //  browser.ClientSize = new Size(_width, _height);
        browser.ScrollBarsEnabled = false;
        browser.ScriptErrorsSuppressed = true;
        //browser.DocumentCompleted += browser_DocumentCompleted;
        browser.Navigate(_url);

        // Wait for control to load page
        while (browser.ReadyState != WebBrowserReadyState.Complete)
            Application.DoEvents();

        Thread.Sleep(5000);


        var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)browser.Document.DomDocument;

        html = documentAsIHtmlDocument3.documentElement.outerHTML; 


        browser.Dispose();
    }


}

The DOM from google chrome developer tool

The DOM I get in my code

I hope that someone can help me with this problem

like image 216
Abubakr A.Hafiz Avatar asked Feb 27 '17 21:02

Abubakr A.Hafiz


1 Answers

If the client-side script is indeed executing in IE7 as you say, the issue might be purely timing. Even after the document's load is completed, you cannot know exactly when the JS scripts will be executed. Waiting 5 seconds before trying to reach for the documentElement sounds like a good idea in theory; in practice, the element might exist before that. Or, perhaps the network is slow and merely fetching jQuery script takes 5 seconds on its own.

I suggest to test for the existence of the element you are looking for (an img tag, as the case may be). Something along the lines of

while (browser.Document.GetElementsByTagName("img").Count == 0) {
    Application.DoEvents();
}

This way, you wouldn't need the Thread.Sleep line.

like image 154
MrMister Avatar answered Oct 13 '22 13:10

MrMister