I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.
I wrote this regex, but it's matching the whole url
Pattern.compile("[.]?.*[.x][a-z]{2,3}");
I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.
Update:
The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.
This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:
https://blog.codinghorror.com/the-problem-with-urls/
https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])
Is a good starting point
Some listings from "Mastering Regular Expressions" on this topic:
http://regex.info/listing.cgi?ed=3&p=207
@sjobe
>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)
sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With