a string like: 'www.test.com' is good. a string like: 'www.888.com' is good. a string like: 'stackoverflow.com' is good. a string like: 'GOoGle.Com' is good.
why ? because those are valid urls. it does not necessarely matter if they have been registered or not.
now bad strings are:
'goog*d\x' 'manydots...com'
why because you can't register those urls.
if I have a string in java which is supposed to be a good url what's the best way to validate it ?
thanks a lot
We can use java. net. url class to validate a URL. The idea is to create a URL object from the specified String representation.
A URL is composed from a limited set of characters belonging to the US-ASCII character set. These characters include digits (0-9), letters(A-Z, a-z), and a few special characters ( "-" , "." , "_" , "~" ).
You can use the URLConstructor to check if a string is a valid URL. URLConstructor ( new URL(url) ) returns a newly created URL object defined by the URL parameters. A JavaScript TypeError exception is thrown if the given URL is not valid.
These characters are { , } , | , \ , ^ , ~ , [ , ] , and ` . All unsafe characters must always be encoded within a URL.
use UrlValidator from the Apache Commons library. Binary package: http://www.mirrorservice.org/sites/ftp.apache.org/commons/validator/binaries/commons-validator-1.3.1.zip (zip contains .jar files)
Example of usage (Construct a UrlValidator with valid schemes of "http", and "https"):
String[] schemes = {"http","https"}.
UrlValidator urlValidator = new UrlValidator(schemes);
if (urlValidator.isValid("ftp://foo.bar.com/")) {
System.out.println("url is valid");
} else {
System.out.println("url is invalid");
}
prints "url is invalid"
If instead the default constructor is used.
UrlValidator urlValidator = new UrlValidator();
if (urlValidator.isValid("ftp://foo.bar.com/")) {
System.out.println("url is valid");
} else {
System.out.println("url is invalid");
}
prints out "url is valid"
Those examples are hostnames. They're not valid URLs in themselves.
Hostnames are made of .
-separated ‘labels’. Each label must be up to 63 characters of letters, digits and hyphens, but a hyphen must not be the first or last character. It is optional to follow the whole hostname with another dot.
You can match this with a pattern like (assuming case-insensitive):
([a-z0-9]|[a-z0-9][a-z0-9\-]{0,61}[a-z0-9])(\.[a-z0-9]|[a-z0-9][a-z0-9\-]{0,61}[a-z0-9])*\.?
However this matches strings like 1.2.3.4
as well, which although they technically could be host/domain names will actually act as direct IP addresses. You may want to allow that. If you do, you may also want to allow IPv6 addresses, which are colon-separated hex; when embedded in a URL, they also have square brackets around them.
And then of course there's IDNA. Nowadays, 例え.テスト
is a valid IDNA domain name, corresponding to xn--r8jz45g.xn--zckzah
. If you want to allow those you'll need some Unicode support.
Summary: it's quite a bit more difficult than you might think. And that's just hostnames. ‘Validating’ a whole URL is even more work. A simple regex isn't going to hack it. Use a pre-existing library.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With