Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determine if url is a pdf or html file

I am requesting urls using the requests package in python (e.g. file = requests.get(url)). The urls do not specify an extension in them, and sometimes a html file is returned and sometimes a pdf is returned.

Is there a way of determining if the returned file is a pdf or a html, or more generally, what the file format is? The browser is able to determine, so I assume it must be indicated in the response.

like image 847
kyrenia Avatar asked Aug 01 '16 03:08

kyrenia


People also ask

How can I determine if a file is a PDF file?

Adobe PDF files—short for portable document format files—are one of the most commonly used file types today. If you've ever downloaded a printable form or document from the Web, such as an IRS tax form, there's a good chance it was a PDF file. Whenever you see a file that ends with . pdf, that means it's a PDF file.

How do you check that URL is PDF or not in Javascript?

Create variable, which holds an empty string. Use the . indexOf() method like @Mason Wright suggested. If it has the pdf extension build your string by using the newly created variable.

Can a URL be a PDF?

A URL of PDF can be a link ending with . pdf, which a PDF file accessible and even downloadable to anyone who has the link.

How do you find the URL of a PDF?

Click Documents. Find the document you want and click the Edit icon. The URL for the document is highlighted.


1 Answers

This will be found in the Content-Type header, either text/html or application/pdf

 import requests

 r = requests.get('http://example.com/file')
 content_type = r.headers.get('content-type')

 if 'application/pdf' in content_type:
     ext = '.pdf'
 elif 'text/html' in content_type:
     ext = '.html'
 else:
     ext = ''
     print('Unknown type: {}'.format(content_type))

 with open('myfile'+ext, 'wb') as f:
     f.write(r.raw.read())
like image 55
Wayne Werner Avatar answered Sep 26 '22 19:09

Wayne Werner