I am requesting urls using the requests package in python (e.g. file = requests.get(url)). The urls do not specify an extension in them, and sometimes a html file is returned and sometimes a pdf is returned.
Is there a way of determining if the returned file is a pdf or a html, or more generally, what the file format is? The browser is able to determine, so I assume it must be indicated in the response.
Adobe PDF files—short for portable document format files—are one of the most commonly used file types today. If you've ever downloaded a printable form or document from the Web, such as an IRS tax form, there's a good chance it was a PDF file. Whenever you see a file that ends with . pdf, that means it's a PDF file.
Create variable, which holds an empty string. Use the . indexOf() method like @Mason Wright suggested. If it has the pdf extension build your string by using the newly created variable.
A URL of PDF can be a link ending with . pdf, which a PDF file accessible and even downloadable to anyone who has the link.
Click Documents. Find the document you want and click the Edit icon. The URL for the document is highlighted.
This will be found in the Content-Type
header, either text/html
or application/pdf
import requests
r = requests.get('http://example.com/file')
content_type = r.headers.get('content-type')
if 'application/pdf' in content_type:
ext = '.pdf'
elif 'text/html' in content_type:
ext = '.html'
else:
ext = ''
print('Unknown type: {}'.format(content_type))
with open('myfile'+ext, 'wb') as f:
f.write(r.raw.read())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With