I need to download several files via http in Python.
The most obvious way to do it is just using urllib2:
import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close()
But I'll have to deal with the URLs that are nasty in some way, say like this: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf
. When downloaded via the browser, the file has a human-readable name, ie. accounts.pdf
.
Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?
The first thing to do is to use HTTP/2.0 and keep one conection open for all the files with Keep-Alive. The easiest way to do that is to use the Requests library, and use a session. If this isn't fast enough, then you need to do several parallel downloads with either multiprocessing or threads.
Download scripts like that tend to push a header telling the user-agent what to name the file:
Content-Disposition: attachment; filename="the filename.ext"
If you can grab that header, you can get the proper filename.
There's another thread that has a little bit of code to offer up for Content-Disposition
-grabbing.
remotefile = urllib2.urlopen('http://example.com/somefile.zip') remotefile.info()['Content-Disposition']
Based on comments and @Oli's anwser, I made a solution like this:
from os.path import basename from urlparse import urlsplit def url2name(url): return basename(urlsplit(url)[2]) def download(url, localFileName = None): localName = url2name(url) req = urllib2.Request(url) r = urllib2.urlopen(req) if r.info().has_key('Content-Disposition'): # If the response has Content-Disposition, we take file name from it localName = r.info()['Content-Disposition'].split('filename=')[1] if localName[0] == '"' or localName[0] == "'": localName = localName[1:-1] elif r.url != url: # if we were redirected, the real file name we take from the final URL localName = url2name(r.url) if localFileName: # we can force to save the file as specified name localName = localFileName f = open(localName, 'wb') f.write(r.read()) f.close()
It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With