Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I check whether a URL is valid using `urlparse`?

I want to check whether a URL is valid, before I open it to read data.

I was using the function urlparse from the urlparse package:

if not bool(urlparse.urlparse(url).netloc):
 # do something like: open and read using urllin2

However, I noticed that some valid URLs are treated as broken, for example:

url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png

This URL is valid (I can open it using my browser).

Is there a better way to check if the URL is valid?

like image 844
Ziva Avatar asked Aug 12 '14 08:08

Ziva


People also ask

How do you check if a URL is valid or not?

You can use the URLConstructor to check if a string is a valid URL. URLConstructor ( new URL(url) ) returns a newly created URL object defined by the URL parameters. A JavaScript TypeError exception is thrown if the given URL is not valid.

How do you check the URL is valid or not using Python?

Using the validators package The URL validation function is available in the root of the module and will return True if the string is a valid URL, otherwise it returns an instance of ValidationFailure , which is a bit weird but not a deal breaker.

How does Urlparse work in Python?

The urlparse module contains functions to process URLs, and to convert between URLs and platform-specific filenames. Example 7-16 demonstrates. A common use is to split an HTTP URL into host and path components (an HTTP request involves asking the host to return data identified by the path), as shown in Example 7-17.

What does Urllib parse quote do?

parse — Parse URLs into components. This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”


1 Answers

TL;DR: You can't actually. Every answer given already misses 1 or more cases.

  1. String is google.com (invalid since no scheme, even though a browser assumes by default http). Urlparse will be missing scheme and netloc. So all([result.scheme, result.netloc, result.path]) seems to work for this case
  2. String is http://google (invalid since .com is missing). Urlparse will be missing only path. Again all([result.scheme, result.netloc, result.path]) seems to catch this case
  3. String is http://google.com/ (correct). Urlparse will populate scheme, netloc and path. So for this case all([result.scheme, result.netloc, result.path]) works fine
  4. String is http://google.com (correct). Urlparse will be missing only path. So for this case all([result.scheme, result.netloc, result.path]) seems to give a false negative

So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path]). But this works only in cases where the url contains a path (even if that is the / path).

Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/")) you will still get a false positive in case 2

Maybe something more complicated like

final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path]) 
              and len(final_url.netloc.split(".")) > 1)

Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases

like image 172
John Paraskevopoulos Avatar answered Oct 27 '22 10:10

John Paraskevopoulos