Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# screen scraping an ASP.NET web forms page - POST request not completely working

Tags:

c#

asp.net

Please bear with me for this slightly long winded description but I'm having a strange problem with C# screen scraping an ASP.NET web forms page. The steps I'm trying to do are as follows:-

1) The site is secured using basic authentication over HTTPS so I need to login appropriately.

2) I'm performing a GET request on the page to retrieve the __VIEWSTATE value (darn thing does nothing if I don't set this thing!)

3) Once logged in there are several form fields to complete then a submit button which POST's the form to the server

4) When the submit button is pressed the form is POST'd to the server and response is the same page and form but now with an extra little HTML table at the bottom of the form with some data I need to get at.

I've so far managed to sort the login and form post using the WebClient class. I've used fiddler (and firebug) to check the POST field values that are being sent when completing the form normally using a browser. I can successfully get a response from the POST request with the data table in question appearing below the form as expected. The problem however is that although the table is populated with data it is populated with data I don't expect. The data that appears is if I completed the form in a browser as normal but with one particular parameter (a drop down list) set to a different value than I'm passing in my POST request to the server. I've confirmed using fiddler and firebug that I'm passing exactly the same POST parameters that are sent as normal using a web browser human completed form. I'm now totally stuck as to why this one parameter is not being 'taken into consideration' by the server?

The one difference is that this particular control is a select list and it performs a page reload or 'postback' when changed. However this doesn't seem to do anything apart from change some other select lists content later in the form.

I guess I'm asking is there anything else I'm missing that would cause this? I'm totally tearing my hair out on this one. Can anyone help? I've posted the code below (with addresses and parameters blanked out for privacy).

    // a place to store the html
    string responseBody = "";

    // create out web client to handle the request
    using (WebClient webClient = new WebClient())
    {
        // space to store responses from the remote site
        byte[] responseBytes;

        // site uses basic authentication over HTTPS so we'll need to login
        CredentialCache credentials = new CredentialCache();
        credentials.Add(new Uri(Url), "Basic", new NetworkCredential(Username, Password));

        // set the credentials in the web client
        webClient.Credentials = credentials;

        // a place for __VIEWSTATE
        string viewState = "";

        // try and get __VIEWSTATE from the web site
        try
        {
            responseBytes = webClient.DownloadData(Url);
            viewState = GetHtmlInputValue(Encoding.UTF8.GetString(responseBytes), "__VIEWSTATE");
        }
        catch (Exception e)
        {
            bool cancel = false;
            ComponentMetaData.FireError(10, "Read web page data", "Error whilst trying to get __VIEWSTATE from web page: " + e.Message, "", 0, out cancel);
        }

        // add our POST parameters (don't forget the __VIEWSTATE or it won't work as its an ASP.NET web page)
        NameValueCollection requestParameters = new NameValueCollection();

        // add ASP.NET fields
        requestParameters.Add("__EVENTTARGET", __EVENTTARGET);
        requestParameters.Add("__EVENTARGUMENT", __EVENTARGUMENT);
        requestParameters.Add("__LASTFOCUS", __LASTFOCUS);

        // add __VIEWSTATE
        requestParameters.Add("__VIEWSTATE", viewState);

        // all other form parameters
        requestParameters.Add("btnSubmit", btnSubmit);      
        /* I've hidden the rest of the parameters hidden for privacy just in case */

        // see if we can connect and get data
        try
        {
            // set content type
            webClient.Headers.Clear();
            webClient.Headers.Add("Content-Type", "application/x-www-form-urlencoded");                             

            // 'POST' the form data using web client and hope we get a response
            responseBytes = webClient.UploadValues(Url, "POST", requestParameters);

            // transform the response to a string
            responseBody = Encoding.UTF8.GetString(responseBytes);
        }
        catch (Exception e)
        {
            bool cancel = false;
            ComponentMetaData.FireError(10, "Read web page data", "Error whilst trying to connect to web page: " + e.Message, "", 0, out cancel);
        }
    }

Please ignore the 'ComponentMetaData' references as this is part of SSIS script source.

Any ideas or help will be greatly appreciated - cheers!

RE: thanks for the quick responses, all I can say to those comments is...

There's the normal ASP session cookie but there's no values in the cookie (apart from the session ID of course), I figured as the site is using basic authentication not forms authentication I could just ignore the cookie - and as I'm getting into the site and getting data returned this was ok. I guess it's worth a try but I'll have to just alter the code to use the WebRequest class method instead...

As for the select list javascript, no there's no javascript changing the value of the select list after page load. The only javascript on the select list is an onchange event to do a 'postback' which only seems to change some other select lists on the form that are empty anyway in the final POST. Note I'm including all the POST parameters when generating the POST request even if they're empty and I'm also including all the 'web forms' special fields such as __VIEWSTATE, __EVENTTARGET etc...

I'm no expert in web forms (MVC man myself) but is there anything else that the web forms 'engine' is expecting? I've sent 1 header for the 'Content-Type' of 'application/x-www-form-urlencoded' but I've tried setting others such as copying the 'User-Agent' header from the original POST but this ends up with me getting a 500 error from the server, not sure why that would happen??

Here's the code for the 'GetHtmlInputValue' its a bit simple/basic and could be done better but:-

    private string GetHtmlInputValue(string html, string inputID)
    {
        string valueDelimiter = "value=\"";

        int namePosition = html.IndexOf(inputID);
        int valuePosition = html.IndexOf(valueDelimiter, namePosition);

        int startPosition = valuePosition + valueDelimiter.Length;
        int endPosition = html.IndexOf("\"", startPosition);

        return html.Substring(startPosition, endPosition - startPosition);
    }
like image 929
padigan Avatar asked Jul 21 '15 15:07

padigan


1 Answers

If I understand you correctly, then selecting an item in the dropdown will cause a POST to be performed, and the server alters the available options in another part of the form. The server will then include the current value of the dropdown in the __VIEWSTATE field value.

When you perform the scraping, you should make sure that the __VIEWSTATE contains the desired value for the dropdown. To investigate further, try to decode the viewstate from the server and see which values are sent back.

like image 149
Martin Wiboe Avatar answered Sep 29 '22 07:09

Martin Wiboe