Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL parsing in Python - normalizing double-slash in paths

I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?

like image 512
shevron Avatar asked Jan 19 '12 12:01

shevron


1 Answers

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

The url will be at index 0 of the list returned and the query at index 1.

It is not possible to have two '?' in an url so it should work for all urls.

like image 54
Eric Fortin Avatar answered Nov 04 '22 02:11

Eric Fortin