Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python : extract .csv results after submitting data to a form with mechanize

I'm new to extracting data from the web with Python. Thanks to some other posts and this webpage, I figured out how to submit data to a form, using the module mechanize.

Now, I am stuck with finding how to extract the results. There are lots of different results when submitting the form, but if I could get access to the csv files that would be perfect. I assume you have to use the module re, but then how do you download the results via Python ?

After running the job, the csv files are here: Summary => Results => Download Heavy Chain Table (you can just click "load example" to see how the webpage works).

import re
import mechanize

br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this

url = 'http://circe.med.uniroma1.it/proABC/index.php'
response = br.open(url)

br.form = list(br.forms())[1]

# Controls can be found by name
control1 = br.form.find_control("light")

# Text controls can be set as a string
br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC" 
br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC"

# To submit form
response = br.submit()
content = response.read()
# print content

result = re.findall(r"Prob_Heavy.csv", content)
print result

When printing content, the lines that I'm interested looks like :

<h2>Results</h2><br>
Predictions for Heavy Chain:
<a href='u17003I9f1/Prob_Heavy.csv'>Download Heavy Chain Table</a><br>
Predictions for Light Chain:
<a href='u17003I9f1/Prob_Light.csv'>Download Light Chain Table</a><br>

So the question is : how do I download / get access to href='u17003I9f1/Prob_Heavy.csv' ?

like image 546
Natha Avatar asked Jun 10 '26 13:06

Natha


2 Answers

Here's a quick and dirty example using BeautifulSoup and requests to avoid parsing HTML using regular expressions. sudo pip install bs4 if you have pip but not BeautifulSoup installed already.

import re
import mechanize
from bs4 import BeautifulSoup as bs
import requests
import time


br = mechanize.Browser()
br.set_handle_robots(False)   # ignore robots
br.set_handle_refresh(False)  # can sometimes hang without this

url_base = "http://circe.med.uniroma1.it/proABC/"
url_index = url_base + "index.php"

response = br.open(url_index)

br.form = list(br.forms())[1]

# Controls can be found by name
control1 = br.form.find_control("light")

# Text controls can be set as a string
br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC" 
br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC"

# To submit form
response = br.submit()
content = response.read()
# print content

soup = bs(content)
urls_csv = [x.get("href") for x in soup.findAll("a") if ".csv" in x.get("href")]
for file_path in urls_csv:
    status_code = 404
    retries = 0
    url_csv = url_base + file_path
    file_name = url_csv.split("/")[-1]
    while status_code == 404 and retries < 10:
        print "{} not ready yet".format(file_name)
        req = requests.get(url_csv )
        status_code = req.status_code
        time.sleep(5)
    print "{} ready. Saving.".format(file_name)
    with open(file_name, "wb") as f:
        f.write(req.content)

Running the script in the REPL:

Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv ready. Saving.
Prob_Light.csv not ready yet
Prob_Light.csv ready. Saving.
>>> 
>>> 
like image 162
jDo Avatar answered Jun 12 '26 12:06

jDo


In Python2, which it looks like you're using, use urllib2.

>>> import urllib2
>>> URL = "http://circe.med.uniroma1.it/proABC/u17003I9f1/Prob_Heavy.csv"
>>> urllib2.urlopen(URL).read()

Or if you're trying to it dynamically based on the href, you can do:

>>> import urllib2
>>> href='u17003I9f1/Prob_Heavy.csv'
>>> URL = 'http://circe.med.uniroma1.it/proABC/' + href
>>> urllib2.urlopen(URL).read()
like image 37
Christopher Shroba Avatar answered Jun 12 '26 11:06

Christopher Shroba