Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which Unicode characters are allowed in IDN host labels?

Tags:

unicode

tld

idn

I’m currently working on a “proper” URI validator, and currently it all comes down to hostname validation; the rest isn’t that tricky.

I’m stuck on IDN hostname labels (i.e., containing Unicode; possible punycode encoded strings have been decoded at this point).

My first idea was basically one regex for TLDs which don’t support IDNs and one for those which do. This could perhaps be based on Mozilla’s list of IDN-enabled TLDs. Respectively, ^[a-zA-Z0-9\-]+$ and ^[a-zA-Z0-9\-\p{L}]+$. However, this is not an ideal situation, since every IDN registrar can decide which characters to allow.

What I’m looking for is a proper, consistent, up to date data table of the Unicode characters allowed in various TLDs. It’s beginning to look like I have to find all the data myself at Russian and Chinese registry sites (which is quite difficult).

So before I go trying to gather all this data myself, I wondered whether such a list already exists. Or are there better approaches, best/common practices, etc.? (I want the validation to be as strict as possible.)

like image 278
Roland Franssen Avatar asked May 17 '10 19:05

Roland Franssen


People also ask

Can hostnames have Unicode?

Domain names, such as "macchiati.blogspot.com", were originally designed only to support ASCII characters. In 2003, the first specification was released that allows most Unicode characters to be used in domain names.

What is the highest Unicode character?

The maximum possible number of code points Unicode can support is 1,114,112 through seventeen 16-bit planes. Each plane can support 65,536 different code points. Among the more than one million code points that Unicode can support, version 4.0 curently defines 96,382 characters at plane 0, 1, 2, and 14.

Does DNS use ASCII?

The DNS, which performs a lookup service to translate mostly user-friendly names into network addresses for locating Internet resources, is restricted in practice to the use of ASCII characters, a practical limitation that initially set the standard for acceptable domain names.

What is Punycode used for?

Punycode is a way of converting words that cannot be written in ASCII, into a Unicode ASCII encoding. Why would you want to do this? The global Domain Name System (DNS), the naming system for any resource connected to the internet, is limited to ASCII characters.


2 Answers

IANA maintains a list of all of the codepoints and their status at https://www.iana.org/assignments/idna-tables-6.3.0/idna-tables-6.3.0.xhtml#idna-tables-properties

All of the ones marked PVALID are safe to use. The ones marked CONTEXTO or CONTEXTJ have more rules to follow. Read RFC5892 (IDNA) and RFC6452 (changing the status of a couple of characters) for all of the gory details.

like image 89
Joe Hildebrand Avatar answered Oct 05 '22 16:10

Joe Hildebrand


Can't you convert all Unicode domains to punycode and validate that? Since DNS doesn't support real UTF-8 chars anyways, this might be the best solution.

like image 27
Byron Whitlock Avatar answered Oct 05 '22 15:10

Byron Whitlock