Is it possible for a valid URL to contain non-escaped Unicode characters?
Q: What is an Internationalized Domain Name (IDN)? Domain names, such as "macchiati.blogspot.com", were originally designed only to support ASCII characters. In 2003, the first specification was released that allows most Unicode characters to be used in domain names.
Building a valid URL By the same token, any code that generates or accepts UTF-8 input might treat URLs with UTF-8 characters as "valid", but would also need to translate those characters before sending them out to a web server. This process is called URL-encoding or percent-encoding.
URLs can only be sent over the Internet using the ASCII character-set. Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces.
The standard, which is maintained by the Unicode Consortium, defines as of the current version (15.0) 149,186 characters covering 161 modern and historic scripts, as well as symbols, emoji (including in colors), and non-visual control and formatting codes.
Yes, the subset of ASCII (and therefore Unicode) that is allowed unescaped in URIs, such as letters and numbers. But the majority of the Unicode character set has to be percent-encoded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With