I am trying to scrape http://www.nscb.gov.ph/ggi/database.asp, specifically all the tables you get from selecting the municipalities/provinces. I am using python with lxml.html and mechanize. my scraper works fine so far, however I get HTTP Error 500: Internal Server Error
when submitting the municipality[19] "Peñarrubia, Abra". I suspect this is due to the character encoding. My guess is that the ene character (n with a tilde above) causes this problem. How can I fix this?
A working example of this part of my script is shown below. As I am just starting out in python (and often use snippets I find on SO), any further comments are greatly appreciated.
from BeautifulSoup import BeautifulSoup
import mechanize
import lxml.html
import csv
class PrettifyHandler(mechanize.BaseHandler):
def http_response(self, request, response):
if not hasattr(response, "seek"):
response = mechanize.response_seek_wrapper(response)
# only use BeautifulSoup if response is html
if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
soup = BeautifulSoup(response.get_data())
response.set_data(soup.prettify())
return response
site = "http://www.nscb.gov.ph/ggi/database.asp"
output_mun = csv.writer(open(r'output-municipalities.csv','wb'))
output_prov = csv.writer(open(r'output-provinces.csv','wb'))
br = mechanize.Browser()
br.add_handler(PrettifyHandler())
# gets municipality stats
response = br.open(site)
br.select_form(name="form2")
muns = br.find_control("strMunicipality2", type="select").items
# municipality #19 is not working, those before do
for pos, item in enumerate(muns[19:]):
br.select_form(name="form2")
br["strMunicipality2"] = [item.name]
print pos, item.name
response = br.submit(id="button2", type="submit")
html = response.read()
root = lxml.html.fromstring(html)
table = root.xpath('//table')[1]
data = [
[td.text_content().strip() for td in row.findall("td")]
for row in table.findall("tr")
]
print data, "\n"
for row in data[2:]:
if row:
row.append(item.name)
output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
response = br.open(site) #go back button not working
# provinces follow here
Thank you very much!
edit: to be specific, the error occur on this line
response = br.submit(id="button2", type="submit")
quick and dirty hack:
def _pairs(self):
return [(k, v.decode('utf-8').encode('latin-1')) for (i, k, v, c_i) in self._pairs_and_controls()]
from mechanize import HTMLForm
HTMLForm._pairs = _pairs
or something less invasive (I think there are no other solutions because the class Item protects 'name' field)
item.__dict__['name'] = item.name.decode('utf-8').encode('latin-1')
before
br["strMunicipality2"] = [item.name]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With