Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining a url with urlunparse

I'm writing something to 'clean' a URL. In this case all I'm trying to do is return a faked scheme as urlopen won't work without one. However, if I test this with www.python.org It'll return http:///www.python.org. Does anyone know why the extra /, and is there a way to return this without it?

def FixScheme(website):

   from urlparse import urlparse, urlunparse

   scheme, netloc, path, params, query, fragment = urlparse(website)

   if scheme == '':
       return urlunparse(('http', netloc, path, params, query, fragment))
   else:
       return website
like image 382
Ben Avatar asked Sep 26 '10 14:09

Ben


People also ask

How do I combine two URLs?

Use the urljoin method from the urllib. parse module to join a base URL with another URLs, e.g. result = urljoin(base_url, path) . The urljoin method constructs a full (absolute) URL by combining a base URL with another URL. Copied!

How do I find the base URL in python?

Pass the url to the urlparse method from the urllib. parse module. Access the netloc attribute on the parse result.


2 Answers

Problem is that in parsing the very incomplete URL www.python.org, the string you give is actually taken as the path component of the URL, with the netloc (network location) one being empty as well as the scheme. For defaulting the scheme you can actually pass a second parameter scheme to urlparse (simplifying your logic) but that does't help with the "empty netloc" problem. So you need some logic for that case, e.g.

if not netloc:
    netloc, path = path, ''
like image 85
Alex Martelli Avatar answered Oct 07 '22 02:10

Alex Martelli


It's because urlparse is interpreting "www.python.org" not as the hostname (netloc), but as the path, just as a browser would if it encountered that string in an href attribute. Then urlunparse seems to interpret scheme "http" specially. If you put in "x" as the scheme, you'll get "x:www.python.org".

I don't know what range of inputs you're dealing with, but it looks like you might not want urlparse and urlunparse.

like image 34
Ned Batchelder Avatar answered Oct 07 '22 03:10

Ned Batchelder