I am trying to retrieve a file using urlretrieve
, while adding a custom header.
While checking the codesource of urllib.request
I realized urlopen
can take a Request
object in parameter instead of just a string, allowing to put the header I want.
But if I try to do the same with urlretrieve
, I get a TypeError: expected string or bytes-like object as mentionned in this other post.
What I ended up doing is rewriting my own urlretrieve, removing the line throwing the error (that line is irrelevant in my use case).
It works fine but I am wondering if there is a better/cleaner way of doing it, rather than rewriting my own urlretrieve
. If it is possible to pass a custom header to urlopen
, it feels like it should be possible to do the same with urlretrieve
?
In line 14, the urllib. request. urlretrieve() function is used to retrieve the image from the given url and store it to the required file directory.
The Python 3 standard library has a new urllib, that is a merged/refactored/rewritten version of those two packages. urllib3 is a third-party package. Despite the name, it is unrelated to the standard library packages, and there is no intention to include it in the standard library in the future.
The urllib module in Python 3 allows you access websites via your program. This opens up as many doors for your programs as the internet opens up for you. urllib in Python 3 is slightly different than urllib2 in Python 2, but they are mostly the same.
I found a way where you only have to add a few extra lines of code...
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve("type URL here", "path/file_name")
Should you wish to learn about the details you can refer to the python documentation: https://docs.python.org/3/library/urllib.request.html
The urllib.request.urlretrieve()
use inside urllib.request.urlopen()
(at least in Python 3). So you can use same way how you can influence behavior of urlopen
.
When urlopen(params)
is invoked it actually first looks at the special global variable urllib.request._opener
and if it is None
then the urlopen
set the variable with default set of openers otherwise it will keep it as it. In the next step it will call urllib.request._opener.open(<urlopen_params>)
(in next sections I will refer urllib.request._opener
only as opener
).
The opener.open()
contains list of handlers for different protocols. When the opener.open()
is called then it will do this actions:
urllib.request.Request
object (or if you provide directly the Request
it will just use it).Request
object is extracted the protocol (it deduced from URL scheme).protocol_request
(e.g. http_request
) - it used for pre-process the request before the connection is opened.protocol_open
- actually creates connection with the remote serverprotocol_response
- process the response from the serverFor your own opener you have to do those 3 steps:
urllib.request.build_opener
)urllib.request._opener
(function urllib.request.install_opener
)The urllib.request.build_opener
creates opener which contains your custom handler and add default openers except handlers from which is your custom handler inherited.
So for adding custom header you can write something like this:
import urllib.request as req
class MyHTTP(req.HTTPHandler):
def http_request(self, req):
req.headers["MyHeader"] = "Content of my header"
return super().http_request(req)
opener = req.build_opener(MyHTTP())
req.install_opener(opener)
From this point when you call urllib.request.urlretrieve()
or anything which is using the urlopen()
it will use for HTTP communication your handler. When you want to get back to default handlers you can just call:
import urllib.request as req
req.install_opener(req.build_opener())
To be honest I don't know if it is better/cleaner solution then yours but it uses prepared mechanisms in the urllib
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With