URL parsing in Python - normalizing double-slash in paths

Question

I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?

Eric Fortin · Accepted Answer

If you only want to get the url without the query part, I would skip the urlparse module and just do:

testUrl.rsplit('?')

The url will be at index 0 of the list returned and the query at index 1.

It is not possible to have two '?' in an url so it should work for all urls.

URL parsing in Python - normalizing double-slash in paths

Tags:

python

urlparse

shevron

1 Answers

Eric Fortin

Recent Activity

Donate For Us

URL parsing in Python - normalizing double-slash in paths

Tags:

python

urlparse

shevron

1 Answers

Eric Fortin

Related questions

Recent Activity

Donate For Us