I want to check whether a URL is valid, before I open it to read data.
I was using the function urlparse
from the urlparse
package:
if not bool(urlparse.urlparse(url).netloc):
# do something like: open and read using urllin2
However, I noticed that some valid URLs are treated as broken, for example:
url = upload.wikimedia.org/math/8/8/d/88d27d47cea8c88adf93b1881eda318d.png
This URL is valid (I can open it using my browser).
Is there a better way to check if the URL is valid?
You can use the URLConstructor to check if a string is a valid URL. URLConstructor ( new URL(url) ) returns a newly created URL object defined by the URL parameters. A JavaScript TypeError exception is thrown if the given URL is not valid.
Using the validators package The URL validation function is available in the root of the module and will return True if the string is a valid URL, otherwise it returns an instance of ValidationFailure , which is a bit weird but not a deal breaker.
The urlparse module contains functions to process URLs, and to convert between URLs and platform-specific filenames. Example 7-16 demonstrates. A common use is to split an HTTP URL into host and path components (an HTTP request involves asking the host to return data identified by the path), as shown in Example 7-17.
parse — Parse URLs into components. This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
TL;DR: You can't actually. Every answer given already misses 1 or more cases.
all([result.scheme, result.netloc, result.path])
seems to work for this caseall([result.scheme, result.netloc, result.path])
seems to catch this caseall([result.scheme, result.netloc, result.path])
works fineall([result.scheme, result.netloc, result.path])
seems to give a false negative
So from the above cases you see that the one that comes closest to a solution is all([result.scheme, result.netloc, result.path])
. But this works only in cases where the url contains a path (even if that is the / path).
Even if you try to enforce a path (i.e urlparse(urljoin(your_url, "/"))
you will still get a false positive in case 2
Maybe something more complicated like
final_url = urlparse(urljoin(your_url, "/"))
is_correct = (all([final_url.scheme, final_url.netloc, final_url.path])
and len(final_url.netloc.split(".")) > 1)
Maybe you also want to skip scheme checking and assume http if no scheme. But even this will get you up to a point. Although it covers the above cases, it doesn't fully cover cases where a url contains an ip instead of a hostname. For such cases you will have to validate that the ip is a correct ip. And there are more scenarios as well. See https://en.wikipedia.org/wiki/URL to think even more cases
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With