I am trying to retrieve a file using <code>urlretrieve</code>, while adding a custom header. While checking the codesource of <code>urllib.request</code> I realized <code>urlopen</code> can take a <code>Request</code> object in parameter instead of just a string, allowing to put the header I want. But if I try to do the same with <code>urlretrieve</code>, I get a TypeError: expected string or bytes-like object as mentionned in this other post. What I ended up doing is rewriting my own urlretrieve, removing the line throwing the error (that line is irrelevant in my use case). It works fine but I am wondering if there is a better/cleaner way of doing it, rather than rewriting my own <code>urlretrieve</code>. If it is possible to pass a custom header to <code>urlopen</code>, it feels like it should be possible to do the same with <code>urlretrieve</code>?

The <code>urllib.request.urlretrieve()</code> use inside <code>urllib.request.urlopen()</code> (at least in Python 3). So you can use same way how you can influence behavior of <code>urlopen</code>. When <code>urlopen(params)</code> is invoked it actually first looks at the special global variable <code>urllib.request._opener</code> and if it is <code>None</code> then the <code>urlopen</code> set the variable with default set of openers otherwise it will keep it as it. In the next step it will call <code>urllib.request._opener.open(<urlopen_params>)</code> (in next sections I will refer <code>urllib.request._opener</code> only as <code>opener</code>). The <code>opener.open()</code> contains list of handlers for different protocols. When the <code>opener.open()</code> is called then it will do this actions: <ol> <li>Creates from URL <code>urllib.request.Request</code> object (or if you provide directly the <code>Request</code> it will just use it).</li> <li>From the <code>Request</code> object is extracted the protocol (it deduced from URL scheme).</li> <li>Based on the protocol it will try lookup and use those methods: <ul> <li> <code>protocol_request</code> (e.g. <code>http_request</code>) - it used for pre-process the request before the connection is opened.</li> <li> <code>protocol_open</code> - actually creates connection with the remote server</li> <li> <code>protocol_response</code> - process the response from the server</li> <li>for other methods look at the Python's documentation </li> </ul> </li> </ol> For your own opener you have to do those 3 steps: <ol> <li>Create own handler</li> <li>Build list of handlers contains your custom handler (function <code>urllib.request.build_opener</code>)</li> <li>Install the new opener into <code>urllib.request._opener</code> (function <code>urllib.request.install_opener</code>)</li> </ol> The <code>urllib.request.build_opener</code> creates opener which contains your custom handler and add default openers except handlers from which is your custom handler inherited. So for adding custom header you can write something like this: <pre class="prettyprint"><code>import urllib.request as req class MyHTTP(req.HTTPHandler): def http_request(self, req): req.headers["MyHeader"] = "Content of my header" return super().http_request(req) opener = req.build_opener(MyHTTP()) req.install_opener(opener) </code></pre> From this point when you call <code>urllib.request.urlretrieve()</code> or anything which is using the <code>urlopen()</code> it will use for HTTP communication your handler. When you want to get back to default handlers you can just call: <pre class="prettyprint"><code>import urllib.request as req req.install_opener(req.build_opener()) </code></pre> To be honest I don't know if it is better/cleaner solution then yours but it uses prepared mechanisms in the <code>urllib</code>.

urllib.urlretrieve with custom header

Tags:

python-3.x

urllib

urlretrieve

I am trying to retrieve a file using urlretrieve, while adding a custom header.

While checking the codesource of urllib.request I realized urlopen can take a Request object in parameter instead of just a string, allowing to put the header I want. But if I try to do the same with urlretrieve, I get a TypeError: expected string or bytes-like object as mentionned in this other post.

What I ended up doing is rewriting my own urlretrieve, removing the line throwing the error (that line is irrelevant in my use case).

It works fine but I am wondering if there is a better/cleaner way of doing it, rather than rewriting my own urlretrieve. If it is possible to pass a custom header to urlopen, it feels like it should be possible to do the same with urlretrieve?

367

asked Jul 21 '17 23:07

realUser404

2 Answers

I found a way where you only have to add a few extra lines of code...

import urllib.request

opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)
urllib.request.urlretrieve("type URL here", "path/file_name")

Should you wish to learn about the details you can refer to the python documentation: https://docs.python.org/3/library/urllib.request.html

answered Sep 20 '22 01:09

Lost Crotchet

The urllib.request.urlretrieve() use inside urllib.request.urlopen() (at least in Python 3). So you can use same way how you can influence behavior of urlopen.

When urlopen(params) is invoked it actually first looks at the special global variable urllib.request._opener and if it is None then the urlopen set the variable with default set of openers otherwise it will keep it as it. In the next step it will call urllib.request._opener.open(<urlopen_params>) (in next sections I will refer urllib.request._opener only as opener).

The opener.open() contains list of handlers for different protocols. When the opener.open() is called then it will do this actions:

Creates from URL urllib.request.Request object (or if you provide directly the Request it will just use it).
From the Request object is extracted the protocol (it deduced from URL scheme).
Based on the protocol it will try lookup and use those methods:
- protocol_request (e.g. http_request) - it used for pre-process the request before the connection is opened.
- protocol_open - actually creates connection with the remote server
- protocol_response - process the response from the server
- for other methods look at the Python's documentation

For your own opener you have to do those 3 steps:

Create own handler
Build list of handlers contains your custom handler (function urllib.request.build_opener)
Install the new opener into urllib.request._opener (function urllib.request.install_opener)

The urllib.request.build_opener creates opener which contains your custom handler and add default openers except handlers from which is your custom handler inherited.

So for adding custom header you can write something like this:

import urllib.request as req

class MyHTTP(req.HTTPHandler):
    def http_request(self, req):
        req.headers["MyHeader"] = "Content of my header"
        return super().http_request(req)

opener = req.build_opener(MyHTTP())
req.install_opener(opener)

From this point when you call urllib.request.urlretrieve() or anything which is using the urlopen() it will use for HTTP communication your handler. When you want to get back to default handlers you can just call:

import urllib.request as req   

req.install_opener(req.build_opener())

To be honest I don't know if it is better/cleaner solution then yours but it uses prepared mechanisms in the urllib.

answered Sep 21 '22 01:09

Qeek

Related questions
                            
                                Invalid Syntax in except handler when using comma
                            
                                python not recognized in Windows CMD even after adding to PATH
                            
                                redis-py and hgetall behavior
                            
                                How to consume the Github GraphQL API using Python?
                            
                                Receiving Import Error: No Module named ***, but has __init__.py
                            
                                Map object is not JSON serializable
                            
                                python version 3.4 does not support a 'ur' prefix
                            
                                How do I make Python3 the default Python in Geany
                            
                                How to remove or hide x-axis labels from a seaborn / matplotlib plot
                            
                                How to integrate SimpleGUI with Python 2.7 and 3.0 shell
                            
                                How can I write asyncio coroutines that optionally act as regular functions?
                            
                                MIMEText UTF-8 encode problems when sending email
                            
                                How to pass an array to python through command line [duplicate]
                            
                                django - update date automatically after a value change
                            
                                How can we get the default behavior of __repr__()?
                            
                                Skip unittest if some-condition in SetUpClass fails
                            
                                Displaying pair plot in Pandas data frame
                            
                                pip install dryscrape fails with "error: [Errno 2] No such file or directory: 'src/webkit_server'"?
                            
                                How to configure Atom to run Python3 scripts?
                            
                                How to convert from Base64 to string Python 3.2 [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With