Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep valid domain regex [duplicate]

Tags:

regex

grep

bash

dns

I'm trying to make a regex for grep that match only valid domains.

My version work pretty well but match the following invalid domain :

@subdom..dom.ext

Here is my regex :

echo "@dom.ext" | grep "^@[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"

I'm working with bash so I escaped special characters.

Sample that should match :

@subdom.dom.ext
@subsubdom.subdom.dom.ext
@subsub-dom.sub-dom.ext

Thanks for help

like image 905
Arka Avatar asked Jan 16 '14 20:01

Arka


People also ask

Is valid domain regex?

The valid domain name must satisfy the following conditions: The domain name should be a-z or A-Z or 0-9 and hyphen (-). The domain name should be between 1 and 63 characters long. The domain name should not start or end with a hyphen(-) (e.g. -geeksforgeeks.org or geeksforgeeks.org-).

How to do regex in grep?

Grep Regular Expression In its simplest form, when no regular expression type is given, grep interpret search patterns as basic regular expressions. To interpret the pattern as an extended regular expression, use the -E ( or --extended-regexp ) option.

How do I check if a domain is valid?

If you want to find out if a domain name is validated, simply type the URL into the WHOIS database. The search results will also provide you with other crucial information such as who owns it, when it was registered and when it is due to expire.

What do you mean by regular expression grep utility explain?

A regular expression(regex) is defined as a pattern that defines a class of strings. Given a string, we can then test if the string belongs to this class of patterns. Regular expressions are used by many of the unix utilities like grep, sed, awk, vi, emacs etc. We will learn the syntax of describing regex later.


1 Answers

A truly complete solution requires more work, but here's an approximation that may work well enough (note that a @ prefix is assumed and the input string is expected to start with it):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$

You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.

Makes the following assumptions, which are more permissive than actual DNS name constraints:

  • Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.

  • There's no limit on the number of nested subdomains.

  • There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).

  • The TLD (last component) is composed of letters only and has a length of at least 2.

  • Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.

Here's a quick test:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext @x.org; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done

Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:

A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$

Caveats:

  • No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.

  • As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):

    • False positive: an invalid Punycode-encoded name such as ab--whatever
    • False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
    • False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
    • False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
    • ... and many more
  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.

  • I'm unclear on whether names in right-to-left writing scripts are properly matched.

  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.

Tip of the hat to @Alfe and for pointing out the problem with IDNs, and to @Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.

like image 134
mklement0 Avatar answered Oct 02 '22 11:10

mklement0