Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get pdf filename with Python requests?

I'm using the Python requests lib to get a PDF file from the web. This works fine, but I now also want the original filename. If I go to a PDF file in Firefox and click download it already has a filename defined to save the pdf. How do I get this filename?

For example:

import requests r = requests.get('http://www.researchgate.net/profile/M_Gotic/publication/260197848_Mater_Sci_Eng_B47_%281997%29_33/links/0c9605301e48beda0f000000.pdf') print r.headers['content-type']  # prints 'application/pdf' 

I checked the r.headers for anything interesting, but there's no filename in there. I was actually hoping for something like r.filename..

Does anybody know how I can get the filename of a downloaded PDF file with requests library?

like image 631
kramer65 Avatar asked Aug 04 '15 08:08

kramer65


People also ask

Can Python read a PDF file?

You can work with a preexisting PDF in Python by using the PyPDF2 package. PyPDF2 is a pure-Python package that you can use for many different types of PDF operations. By the end of this article, you'll know how to do the following: Extract document information from a PDF in Python.


2 Answers

It is specified in an http header content-disposition. So to extract the name you would do:

import re d = r.headers['content-disposition'] fname = re.findall("filename=(.+)", d)[0] 

Name extracted from the string via regular expression (re module).

like image 76
Eugene V Avatar answered Oct 05 '22 12:10

Eugene V


Building on some of the other answers, here's how I do it. If there isn't a Content-Disposition header, I parse it from the download URL:

import re import requests from requests.exceptions import RequestException   url = 'http://www.example.com/downloads/sample.pdf'  try:     with requests.get(url) as r:          fname = ''         if "Content-Disposition" in r.headers.keys():             fname = re.findall("filename=(.+)", r.headers["Content-Disposition"])[0]         else:             fname = url.split("/")[-1]          print(fname) except RequestException as e:     print(e) 

There are arguably better ways of parsing the URL string, but for simplicity I didn't want to involve any more libraries.

like image 30
Nilpo Avatar answered Oct 05 '22 13:10

Nilpo