Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Regular expression to retrieve domain.tld




I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.

I wrote this regex, but it's matching the whole url


I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.


The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.

like image 724
sjobe Avatar asked Nov 27 '22 08:11


1 Answers

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:



Is a good starting point

Some listings from "Mastering Regular Expressions" on this topic:



>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
>>> url.match('http://google.com').groups()

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

like image 143
jsamsa Avatar answered Dec 05 '22 18:12
