Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to validate internationalized domain names [closed]

Tags:

regex

php

dns

I want to validate the domain url in php which may be in internationalized domain name format like in greek domain name= http://παράδειγμα.δοκιμή Is their any way to validate it using regular expression?

like image 369
user1969981 Avatar asked Jan 14 '13 05:01

user1969981


2 Answers

This is a so called IDN domain. Clients supporting IDN domains normalize it using IDNA2008 standard as specified in RFC 5890, then replace remaining unicode characters using Punycode encoding as defined in RFC 3492 before submission for DNS resolution.

By specification, literally every character in the UTF-8 character set is valid to use in a IDN domain, but every top level domain authority can define valid characters within the Unicode charset so it will be hard to create and maintain a real regex.

If you want to accept IDN domains in your application you should internally work with the encoded version. PHP extension intl brings two functions to en- and decode IDN domain names

echo idn_to_ascii('täst.de'); 

xn--tst-qla.de

After encoding, the domain, will pass any traditional regex check

Simple validation:

$url = "http://example.com/";
if (preg_match('/^(http|https|ftp):\/\/([A-Z0-9][A-Z0-9_-]*(?:\.[A-Z0-9][A-Z0-9_-]*)+):?(\d+)?\/?/i', $url)) {
    echo 'OK';
} else {
    echo 'Invalid URL.';
}

EDIT:

If you want a real DNS verfification you can use dns_get_record (PHP 5) or gethostbyaddr

e.g.

$domain = 'ελληνικά.idn.icann.org';
$idnDomain = idn_to_ascii( $domain );

if ( $dnsResult = dns_get_record( $idnDomain, DNS_ANY ) )
{
    echo $idnDomain , "\n";
    print_r( $dnsResult );
}
else
{
    echo "failed to lookup domain\n";
}

Result:

xn--hxargifdar.idn.icann.org
Array 
(
    [0] => Array
    (
        [host] => xn--hxargifdar.idn.icann.org
        [class] => IN
        [ttl] => 21456
        [type] => A
        [ip] => 199.7.85.10
    )
    [1] => Array
    (
        [host] => xn--hxargifdar.idn.icann.org
        [class] => IN
        [ttl] => 21600
        [type] => AAAA
        [ipv6] => 2620::2830:230:0:0:0:10
    )
)
like image 110
Michel Feldheim Avatar answered Oct 28 '22 16:10

Michel Feldheim


If you want to create your own library, you need to use the table of permitted codepoints (IANA — Repository of IDN Practices, IDN Character Validation Guidance, IDNA Parameters) and the table of Unicode Script properties (UNIDATA/Scripts.txt).

Gmail adopts the Unicode Consortium’s “Highly Restricted” specification (Protecting Gmail in a global world). The following combinations of Unicode Scripts are permitted.

  • Single script
  • Latin + Han + Hiragana + Katakana
  • Latin + Han + Bopomofo
  • Latin + Han + Hangul

You may need to pay attention to special script property values (Common, Inherited, Unknown) since some of characters has multiple properties or wrong properties.

For example, U+3099 (COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK) has two properties ("Katakana" and "Hiragana") and PCRE function classify it as "Inherited". Another example is U+x2A708. Although the right script property of U+2A708(combination of U+30C8 KATAKANA LETTER TO and U+30E2 KATAKANA LETTER MO) is "Katakana", The Unicode Specification misclassify it as "Han".

You may need to consider IDN homograph attack. Google Chrome's IDN policy adopts the blacklist chars.

My recommendation is to use Zend\Validator\Hostname. This library uses the table of permitted code points for Japanese and Chinese.

If you use Symfony, consider upgrade the app of version to 2.5 which adopts egulias/email-validatornd (Manual). You need extra validation whether the string is well-formed byte sequence. See my reporta> for the detail.

Don't forget XSS and SQL injection. The following address is valid email address based RFC5322.

// From Japanese tutorial
// http://blog.tokumaru.org/2013/11/xsssqlrfc5322.html
"><script>alert('or/**/1=1#')</script>"@example.jp

I think it's doubtful for using idn_to_ascii for validation since idn_to_ascii passes almost all characters.

for ($i = 0; $i < 0x110000; ++$i) {
    $c = utf8_chr($i);

    if ($c !== '' && false !== idn_to_ascii($c)) {
        $number = strtoupper(dechex($i));
        $length = strlen($number);

        if ($i < 0x10000) {
            $number = str_repeat('0', 4 - $length).$number;
        }
    
        $idn = $c.'example.com';

        echo 'U+'.$number.' ';
        echo ' '.$idn.' '. idn_to_ascii($idn);
        echo PHP_EOL;
    }
}

function utf8_chr($code_point) {

    if ($code_point < 0 || 0x10FFFF < $code_point || (0xD800 <= $code_point && $code_point <= 0xDFFF)) {
        return '';
    }

    if ($code_point < 0x80) {
        $hex[0] = $code_point;
        $ret = chr($hex[0]);
    } else if ($code_point < 0x800) {
        $hex[0] = 0x1C0 | $code_point >> 6;
        $hex[1] = 0x80  | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]);
    } else if ($code_point < 0x10000) {
        $hex[0] = 0xE0 | $code_point >> 12;
        $hex[1] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[2] = 0x80 | $code_point & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]);
    } else  {
        $hex[0] = 0xF0 | $code_point >> 18;
        $hex[1] = 0x80 | $code_point >> 12 & 0x3F;
        $hex[2] = 0x80 | $code_point >> 6 & 0x3F;
        $hex[3] = 0x80 | $code_point  & 0x3F;
        $ret = chr($hex[0]).chr($hex[1]).chr($hex[2]).chr($hex[3]);
    }

    return $ret;
}

If you want to validate domain by Unicode Script properties, use PCRE functions.

The following code show how to get the name of Unicode script property. If you want to the the Unicode Script properties in JavaScript, use mathiasbynens/unicode-data.

function get_unicode_script_name($c) {

  // http://php.net/manual/regexp.reference.unicode.php
  $names = [
    'Arabic', 'Armenian', 'Avestan', 'Balinese', 'Bamum', 'Batak', 'Bengali', 
    'Bopomofo', 'Brahmi', 'Braille', 'Buginese', 'Buhid', 'Canadian_Aboriginal',
    'Carian', 'Chakma', 'Cham', 'Cherokee', 'Common', 'Coptic', 'Cuneiform',
    'Cypriot', 'Cyrillic', 'Deseret', 'Devanagari', 'Egyptian_Hieroglyphs',
    'Ethiopic', 'Georgian', 'Glagolitic', 'Gothic', 'Greek', 'Gujarati', 
    'Gurmukhi', 'Han', 'Hangul', 'Hanunoo', 'Hebrew', 'Hiragana', 'Imperial_Aramaic',
    'Inherited', 'Inscriptional_Pahlavi', 'Inscriptional_Parthian', 'Javanese',
    'Kaithi', 'Kannada', 'Katakana', 'Kayah_Li', 'Kharoshthi', 'Khmer', 'Lao', 'Latin',
    'Lepcha', 'Limbu', 'Linear_B', 'Lisu', 'Lycian', 'Lydian', 'Malayalam', 'Mandaic',
    'Meetei_Mayek', 'Meroitic_Cursive', 'Meroitic_Hieroglyphs', 'Miao', 'Mongolian',
    'Myanmar', 'New_Tai_Lue', 'Nko', 'Ogham', 'Old_Italic', 'Old_Persian',
    'Old_South_Arabian', 'Old_Turkic', 'Ol_Chiki', 'Oriya', 'Osmanya', 'Phags_Pa',
    'Phoenician', 'Rejang', 'Runic', 'Samaritan', 'Saurashtra', 'Sharada', 'Shavian',
    'Sinhala', 'Sora_Sompeng', 'Sundanese', 'Syloti_Nagri', 'Syriac', 'Tagalog',
    'Tagbanwa', 'Tai_Le', 'Tai_Tham', 'Tai_Viet', 'Takri', 'Tamil', 'Telugu', 'Thaana',
    'Thai', 'Tibetan', 'Tifinagh', 'Ugaritic', 'Vai', 'Yi'
  ];

  $ret = [];

  foreach ($names as $name) {

    $pattern = '/\p{'.$name.'}/u';

    if (preg_match($pattern, $c)) {
        return $name;
    }
  }

  return '';
}
like image 25
masakielastic Avatar answered Oct 28 '22 16:10

masakielastic