Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to download a file using python in a 'smarter' way?

I need to download several files via http in Python.

The most obvious way to do it is just using urllib2:

import urllib2 u = urllib2.urlopen('http://server.com/file.html') localFile = open('file.html', 'w') localFile.write(u.read()) localFile.close() 

But I'll have to deal with the URLs that are nasty in some way, say like this: http://server.com/!Run.aspx/someoddtext/somemore?id=121&m=pdf. When downloaded via the browser, the file has a human-readable name, ie. accounts.pdf.

Is there any way to handle that in python, so I don't need to know the file names and hardcode them into my script?

like image 444
kender Avatar asked May 14 '09 08:05

kender


People also ask

How do I make Python download faster?

The first thing to do is to use HTTP/2.0 and keep one conection open for all the files with Keep-Alive. The easiest way to do that is to use the Requests library, and use a session. If this isn't fast enough, then you need to do several parallel downloads with either multiprocessing or threads.


2 Answers

Download scripts like that tend to push a header telling the user-agent what to name the file:

Content-Disposition: attachment; filename="the filename.ext" 

If you can grab that header, you can get the proper filename.

There's another thread that has a little bit of code to offer up for Content-Disposition-grabbing.

remotefile = urllib2.urlopen('http://example.com/somefile.zip') remotefile.info()['Content-Disposition'] 
like image 184
Oli Avatar answered Oct 17 '22 06:10

Oli


Based on comments and @Oli's anwser, I made a solution like this:

from os.path import basename from urlparse import urlsplit  def url2name(url):     return basename(urlsplit(url)[2])  def download(url, localFileName = None):     localName = url2name(url)     req = urllib2.Request(url)     r = urllib2.urlopen(req)     if r.info().has_key('Content-Disposition'):         # If the response has Content-Disposition, we take file name from it         localName = r.info()['Content-Disposition'].split('filename=')[1]         if localName[0] == '"' or localName[0] == "'":             localName = localName[1:-1]     elif r.url != url:          # if we were redirected, the real file name we take from the final URL         localName = url2name(r.url)     if localFileName:          # we can force to save the file as specified name         localName = localFileName     f = open(localName, 'wb')     f.write(r.read())     f.close() 

It takes file name from Content-Disposition; if it's not present, uses filename from the URL (if redirection happened, the final URL is taken into account).

like image 37
kender Avatar answered Oct 17 '22 07:10

kender