My code is as follows:
url_orig ='http://www.has-sante.fr/portail/jcms/c_676945/fr/prialt-ct-5245'
u = urllib.request.urlopen(url_orig)
print (u.geturl())
Basically when the URL gets redirected twice. The output should be:
http://www.has-sante.fr/portail/upload/docs/application/pdf/2008-07/ct-5245_prialt_.pdf
But the output that I'm getting is the first redirect:
http://www.has-sante.fr/portail/plugins/ModuleXitiKLEE/types/FileDocument/doXiti.jsp?id=c_676945
How do I get the required final URL? Any help would be appreciated!
This might be a bit overkill for what you want, but it is an alternative to using regular expressions. This answer uses the Selenium web automator Python APIs to follow the redirects. It will also open up the pdf file in a browser window. The code below requires that you are using Firefox, but you can also use other browsers by replacing the name with the one you want to use i.e. webdriver.Chrome(), webdriver.Ie().
To install selenium: pip install selenium
The code:
from selenium import webdriver
driver = webdriver.Firefox()
link = 'http://www.has-sante.fr/portail/jcms/c_676945/fr/prialt-ct-5245'
driver.get(link)
print(driver.current_url)
It is also possible to run the browser in the background so no window pops up. The added benefit to this solution is that if they change the way the re-direction works you will not need to update the regular expressions in your code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With