I sometimes need to parse with Beautiful Soup and Requests URLs that are provided as such: <blockquote> http://bit.ly/sdflksdfwefwe http://stup.id/sdfslkjsfsd http://0.r.msn.com/sdflksdflsdj </blockquote> Of course, these URLs generally 'resolve' to a canonical URL some as <code>http://real-website.com/page.html</code>. How can I get the last URL in the resolution / redirect chain? My code generally looks like this: <pre class="prettyprint"><code>from bs4 import BeautifulSoup import requests response = requests.get(url) soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding) canonical_url = response.??? ## This is what I need to know </code></pre> Note that I don't mean to query <code>http://bit.ly/bllsht</code> to see where it goes, but rather when I am using Beautiful Soup to already parse the page that it returns, to also get the canonical URL that was the last in the redirect chain. Thanks.

It's in the <code>url</code> attribute of your <code>response</code> object. <pre class="prettyprint"><code>>>> response = requests.get('http://bit.ly/bllsht') >>> response.url > u'http://www.thenews.org/sports/well-hey-there-murray-state-1-21-11-1.2436937' </code></pre> You could easily find this information in the “Quick Start” page.

Return last URL in sequence of redirects

Tags:

python

python-requests

I sometimes need to parse with Beautiful Soup and Requests URLs that are provided as such:

http://bit.ly/sdflksdfwefwe

http://stup.id/sdfslkjsfsd

http://0.r.msn.com/sdflksdflsdj

Of course, these URLs generally 'resolve' to a canonical URL some as http://real-website.com/page.html. How can I get the last URL in the resolution / redirect chain?

My code generally looks like this:

Click to copy

from bs4 import BeautifulSoup
import requests

response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
canonical_url = response.??? ## This is what I need to know

Note that I don't mean to query http://bit.ly/bllsht to see where it goes, but rather when I am using Beautiful Soup to already parse the page that it returns, to also get the canonical URL that was the last in the redirect chain.

Thanks.

498

asked Jun 12 '13 09:06

dotancohen

1 Answers

It's in the url attribute of your response object.

Click to copy

>>> response = requests.get('http://bit.ly/bllsht')
>>> response.url
  > u'http://www.thenews.org/sports/well-hey-there-murray-state-1-21-11-1.2436937'

You could easily find this information in the “Quick Start” page.

answered Sep 28 '22 06:09

kirelagin

Related questions
                            
                                Create closed polygon from boundary points
                            
                                drop duplicates in Python Pandas DataFrame not removing duplicates
                            
                                Python: Writing to files within packages?
                            
                                Python Nose tests from generator not running concurrently
                            
                                update tables with computed columns in sqlalchemy
                            
                                Splitting long string without breaking words fulfilling lines
                            
                                Remove Matplotlib Toolbar from the Graph
                            
                                Vectorizing multiple vector-matrix multiplications in NumPy
                            
                                How to import python modules and expose the methods in Robot Ride
                            
                                Returning object of same subclass in __add__ operator
                            
                                Is there a equivalent to commit in bulbs framework for neo4j
                            
                                Most efficient way to add prefix to Python dictionary keys
                            
                                Is it possible to "sniff" the Character encoding?
                            
                                "os.environ" in django settings.py cannot get system environment variables with apache and wsgi
                            
                                Why does a class get "called" when not initiated? - Python
                            
                                upgrading python django project 1.3 to 1.5
                            
                                How to use global variables in IPython
                            
                                How to do background task in gtk3-python?
                            
                                Simple Python server setup
                            
                                Call to GetModuleHandle on kernel32 using Python C-types

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With