I am trying to download a PDF file from a website and save it to disk. My attempts either fail with encoding errors or result in blank PDFs.
In [1]: import requests In [2]: url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf' In [3]: response = requests.get(url) In [4]: with open('/tmp/metadata.pdf', 'wb') as f: ...: f.write(response.text) --------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-4-4be915a4f032> in <module>() 1 with open('/tmp/metadata.pdf', 'wb') as f: ----> 2 f.write(response.text) 3 UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128) In [5]: import codecs In [6]: with codecs.open('/tmp/metadata.pdf', 'wb', encoding='utf8') as f: ...: f.write(response.text) ...:
I know it is a codec problem of some kind but I can't seem to get it to work.
There are a couple of Python libraries using which you can extract data from PDFs. For example, you can use the PyPDF2 library for extracting text from PDFs where text is in a sequential or formatted manner i.e. in lines or forms. You can also extract tables in PDFs through the Camelot library.
You should use response.content
in this case:
with open('/tmp/metadata.pdf', 'wb') as f: f.write(response.content)
From the document:
You can also access the response body as bytes, for non-text requests:
>>> r.content b'[{"repository":{"open_issues":0,"url":"https://github.com/...
So that means: response.text
return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.
And response.content
return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.
You can also use response.raw
instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:
import requests url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf' r = requests.get(url, stream=True) with open('/tmp/metadata.pdf', 'wb') as fd: for chunk in r.iter_content(chunk_size): fd.write(chunk)
chunk_size
is the chunk size which you want to use. If you set it as 2000
, then requests will download that file the first 2000
bytes, write them into the file, and do this again, again and again, unless it finished.
So this can save your RAM. But I'd prefer use response.content
instead in this case since your file is small. As you can see use response.raw
is complex.
Relates:
How to download large file in python with requests.py?
How to download image using requests
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With