I'm using Scrapy to scrape some gold that's behind an authentication screen. The website uses ASP.net
and ASP's got some stupid hidden fields littered all over the form (like __VIEWSTATE
, __EVENTTARGET
).
When I call FormRequest.from_response(response,...
I'm expecting that it reads these hidden fields automatically from the response and populates them in the formdata
dictionary - which is what Scrapy's FormRequest documentation says it should do.
But if that's the case, then why does the login process only work when I explicitly list these fields and populate them?
class ItsyBitsy(Spider):
name = "itsybitsy"
allowed_domains = ["website.com"]
start_urls = ["http://website.com/cpanel/Default.aspx"]
def parse(self, response):
# Performs authentication to get past the login form
sel = Selector(response)
return [FormRequest.from_response(response,
formdata={
'tb_Username':'admin',
'tb_Password':'password',
# The following fields should be auto populated, right?
# So why does removing 'em break the login (w/500 Server Error)
'__VIEWSTATE':
sel.xpath("//input[@name='__VIEWSTATE']/@value").extract(),
'__EVENTVALIDATION':
sel.xpath("//input[@name='__EVENTVALIDATION']/@value").extract(),
'__EVENTTARGET': 'b_Login'
},
callback=self.after_login,
clickdata={'id':'b_Login'},
dont_click=True)]
def after_login(self, response):
# Mmm, scrumptious
pass
<form id="form1" action="Default.aspx" method="post" name="form1">
<div>
<input type="hidden" value="" id="__EVENTTARGET" name="__EVENTTARGET">
<input type="hidden" value="" id="__EVENTARGUMENT" name="__EVENTARGUMENT">
<input type="hidden" value="/wEPDwULLTE2OTg2NjA1NTAPZBYCAgMPZBYGAgMPD2QWAh4Kb25rZXlwcmVzcwUlcmV0dXJuIGNsaWNrQnV0dG9uKGV2ZW50LCAnYl9Mb2dpbicpO2QCBQ8PZBYCHwAFJXJldHVybiBjbGlja0J1dHRvbihldmVudCwgJ2JfTG9naW4nKTtkAgcPD2QWAh4Hb25jbGljawUPcmV0dXJuIGxvZ2luKCk7ZGRKt/WTOQThVTxB9Y0QcIuRqylCIw==" id="__VIEWSTATE" name="__VIEWSTATE">
</div>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
<div>
<input type="hidden" value="/wEWBAK0o8DDCQLxz5rcDwLF8dCIDALHyYWSA+rA4VJNaEpFIycMDHQPUOz393TI" id="__EVENTVALIDATION" name="__EVENTVALIDATION">
<input type="text" onkeypress="return clickButton(event, 'b_Login');" size="28" class="textfield-text" id="tb_Username" name="tb_Username">
<input type="password" onkeypress="return clickButton(event, 'b_Login');" size="28" class="textfield-text" id="tb_Password" name="tb_Password">
<a href="javascript:__doPostBack('b_Login','')" class="button-link" id="b_Login" onclick="return login();">Login</a>
</form>
According to the source code, Scrapy
uses the following CSS selector to parse the inputs out of the form:
descendant::textarea|descendant::select|descendant::input[@type!="submit" and @type!="image" and @type!="reset"and ((@type!="checkbox" and @type!="radio") or @checked)]
In other words, all of your hidden inputs are successfully parsed (and sent with the request later) with the values equal to value
attributes. So, Scrapy does what it should here.
The login using from_response()
doesn't work because __EVENTTARGET
has a empty value
attribute. If you make the login using a real browser, __EVENTTARGET
parameter value would be set to b_Login
via javascript __doPostBack()
function call. And, since Scrapy cannot handle javascript (cannot call js functions), __EVENTTARGET
is sent with an empty value which causes login failure.
__EVENTARGUMENT
has an empty value
too, but it is actually set to the empty string in the __doPostBack()
function, so it doesn't make a difference here.
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With