I sometimes need to parse with Beautiful Soup and Requests URLs that are provided as such:
http://bit.ly/sdflksdfwefwe
http://stup.id/sdfslkjsfsd
http://0.r.msn.com/sdflksdflsdj
Of course, these URLs generally 'resolve' to a canonical URL some as http://real-website.com/page.html
. How can I get the last URL in the resolution / redirect chain?
My code generally looks like this:
from bs4 import BeautifulSoup
import requests
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
canonical_url = response.??? ## This is what I need to know
Note that I don't mean to query http://bit.ly/bllsht
to see where it goes, but rather when I am using Beautiful Soup to already parse the page that it returns, to also get the canonical URL that was the last in the redirect chain.
Thanks.
Type "cache:sitename.com" in the address bar of Chrome and press "Enter" where "sitename" is the URL that is generating the redirect. This will show you a cached version of the site on which you can use the Inspect Element pane to find and capture the redirect URL.
URL Redirect (also referred to as URL Forwarding) is a technique which is used to redirect your domain's visitors to a different URL. You can forward your domain name to any website, webpage, etc.
To follow redirect with Curl, use the -L or --location command-line option. This flag tells Curl to resend the request to the new address. When you send a POST request, and the server responds with one of the codes 301, 302, or 303, Curl will make the subsequent request using the GET method.
It's in the url
attribute of your response
object.
>>> response = requests.get('http://bit.ly/bllsht')
>>> response.url
> u'http://www.thenews.org/sports/well-hey-there-murray-state-1-21-11-1.2436937'
You could easily find this information in the “Quick Start” page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With