I'm new to extracting data from the web with Python. Thanks to some other posts and this webpage, I figured out how to submit data to a form, using the module mechanize.
Now, I am stuck with finding how to extract the results. There are lots of different results when submitting the form, but if I could get access to the csv files that would be perfect. I assume you have to use the module re, but then how do you download the results via Python ?
After running the job, the csv files are here: Summary => Results => Download Heavy Chain Table (you can just click "load example" to see how the webpage works).
import re
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
url = 'http://circe.med.uniroma1.it/proABC/index.php'
response = br.open(url)
br.form = list(br.forms())[1]
# Controls can be found by name
control1 = br.form.find_control("light")
# Text controls can be set as a string
br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC"
br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC"
# To submit form
response = br.submit()
content = response.read()
# print content
result = re.findall(r"Prob_Heavy.csv", content)
print result
When printing content, the lines that I'm interested looks like :
<h2>Results</h2><br>
Predictions for Heavy Chain:
<a href='u17003I9f1/Prob_Heavy.csv'>Download Heavy Chain Table</a><br>
Predictions for Light Chain:
<a href='u17003I9f1/Prob_Light.csv'>Download Light Chain Table</a><br>
So the question is : how do I download / get access to href='u17003I9f1/Prob_Heavy.csv' ?
Here's a quick and dirty example using BeautifulSoup and requests to avoid parsing HTML using regular expressions. sudo pip install bs4 if you have pip but not BeautifulSoup installed already.
import re
import mechanize
from bs4 import BeautifulSoup as bs
import requests
import time
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.set_handle_refresh(False) # can sometimes hang without this
url_base = "http://circe.med.uniroma1.it/proABC/"
url_index = url_base + "index.php"
response = br.open(url_index)
br.form = list(br.forms())[1]
# Controls can be found by name
control1 = br.form.find_control("light")
# Text controls can be set as a string
br["light"] = "DIQMTQSPASLSASVGETVTITCRASGNIHNYLAWYQQKQGKSPQLLVYYTTTLADGVPSRFSGSGSGTQYSLKINSLQPEDFGSYYCQHFWSTPRTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKDSTYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC"
br["heavy"] = "QVQLKESGPGLVAPSQSLSITCTVSGFSLTGYGVNWVRQPPGKGLEWLGMIWGDGNTDYNSALKSRLSISKDNSKSQVFLKMNSLHTDDTARYYCARERDYRLDYWGQGTTLTVSSASTTPPSVFPLAPGSAAQTNSMVTLGCLVKGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSPRPSETVTCNVAHPASSTKVDKKIVPRDC"
# To submit form
response = br.submit()
content = response.read()
# print content
soup = bs(content)
urls_csv = [x.get("href") for x in soup.findAll("a") if ".csv" in x.get("href")]
for file_path in urls_csv:
status_code = 404
retries = 0
url_csv = url_base + file_path
file_name = url_csv.split("/")[-1]
while status_code == 404 and retries < 10:
print "{} not ready yet".format(file_name)
req = requests.get(url_csv )
status_code = req.status_code
time.sleep(5)
print "{} ready. Saving.".format(file_name)
with open(file_name, "wb") as f:
f.write(req.content)
Running the script in the REPL:
Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv not ready yet
Prob_Heavy.csv ready. Saving.
Prob_Light.csv not ready yet
Prob_Light.csv ready. Saving.
>>>
>>>
In Python2, which it looks like you're using, use urllib2.
>>> import urllib2
>>> URL = "http://circe.med.uniroma1.it/proABC/u17003I9f1/Prob_Heavy.csv"
>>> urllib2.urlopen(URL).read()
Or if you're trying to it dynamically based on the href, you can do:
>>> import urllib2
>>> href='u17003I9f1/Prob_Heavy.csv'
>>> URL = 'http://circe.med.uniroma1.it/proABC/' + href
>>> urllib2.urlopen(URL).read()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With