I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.
One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:
testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl,
urlparse.urlparse(testUrl).path)
Instead of the expected result http://www.example.com//path
(or even better, with a normalized single slash), I end up with http://path
.
BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.
Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?
If you only want to get the url without the query part, I would skip the urlparse module and just do:
testUrl.rsplit('?')
The url will be at index 0 of the list returned and the query at index 1.
It is not possible to have two '?' in an url so it should work for all urls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With