Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python unable to retrieve form with urllib or mechanize

I'm trying to fill out and submit a form using Python, but I'm not able to retrieve the resulting page. I've tried both mechanize and urllib/urllib2 methods to post the form, but both run into problems.

The form I'm trying to retrieve is here: http://zrs.leidenuniv.nl/ul/start.php. The page is in Dutch, but this is irrelevant to my problem. It may be noteworthy that the form action redirects to http://zrs.leidenuniv.nl/ul/query.php.

First of all, this is the urllib/urllib2 method I've tried:

import urllib, urllib2
import socket, cookielib

url = 'http://zrs.leidenuniv.nl/ul/start.php'
params = {'day': 1, 'month': 5, 'year': 2012, 'quickselect' : "unchecked",
          'res_instantie': '_ALL_', 'selgebouw': '_ALL_', 'zrssort': "locatie",
          'submit' : "Uitvoeren"}
http_header = {  "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.46 Safari/535.11",
                 "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                 "Accept-Language" : "nl-NL,nl;q=0.8,en-US;q=0.6,en;q=0.4" }

timeout = 15
socket.setdefaulttimeout(timeout)

request = urllib2.Request(url, urllib.urlencode(params), http_header)
response = urllib2.urlopen(request)

cookies = cookielib.CookieJar()
cookies.extract_cookies(response, request)
cookie_handler = urllib2.HTTPCookieProcessor(cookies)
redirect_handler = urllib2.HTTPRedirectHandler()

opener = urllib2.build_opener(redirect_handler, cookie_handler)

response = opener.open(request)
html = response.read()

However, when I try to print the retrieved html I get the original page, not the one the form action refers to. So any hints as to why this doesn't submit the form would be greatly appreciated.

Because the above didn't work, I also tried to use mechanize to submit the form. However, this results in a ParseError with the following code:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'
br = mechanize.Browser()
response = br.open(url)
br.select_form(nr = 0)

where the last line exits with the following: "ParseError: unexpected '-' char in declaration". Now I realize that this error may indicate an error in the DOCTYPE declaration, but since I can't edit the form page I'm not able to try different declarations. Any help on this error is also greatly appreciated.

Thanks in advance for your help.

like image 890
GjjvdBurg Avatar asked Nov 12 '22 22:11

GjjvdBurg


1 Answers

It's because the DOCTYPE part is malformed.

Also it contains some strange tags like:

<!Co Dreef / Eelco de Graaff Faculteit der Rechtsgeleerdheid Universiteit Leiden><!e-mail [email protected] >

Try validating the page yourself...


Nonetheless, you can just strip off the junk to make mechanizes html parser happy:

import mechanize

url = 'http://zrs.leidenuniv.nl/ul/start.php'

br = mechanize.Browser()
response = br.open(url)
response.set_data(response.get_data()[177:])
br.set_response(response)

br.select_form(nr = 0)
like image 110
sloth Avatar answered Nov 15 '22 11:11

sloth