I'm using the Python requests lib to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download
it already has a filename defined to save the pdf. How do I get this filename?
For example:
import requests r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf') print r.headers['content-type'] # prints 'application/pdf'
I checked the r.headers
for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename
..
Does anybody know how I can get the filename of a downloaded PDF file with requests library?
You can work with a preexisting PDF in Python by using the PyPDF2 package. PyPDF2 is a pure-Python package that you can use for many different types of PDF operations. By the end of this article, you'll know how to do the following: Extract document information from a PDF in Python.
It is specified in an http header content-disposition
. So to extract the name you would do:
import re d = r.headers['content-disposition'] fname = re.findall("filename=(.+)", d)[0]
Name extracted from the string via regular expression (re
module).
Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition
header, I parse it from the download URL:
import re import requests from requests.exceptions import RequestException url = 'http://www.example.com/downloads/sample.pdf' try: with requests.get(url) as r: fname = '' if "Content-Disposition" in r.headers.keys(): fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0] else: fname = url.split("/")[-1] print(fname) except RequestException as e: print(e)
There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With