I'm designing a schema for a message on a microblogging platform, which will need to have a defined language. These messages will be distributed across networks between many nodes, so I need to make the schema compact but still completely multilingual.
I'm going to use the IETF language codes (en
, en-AU
etc.), but I need to know if there is a specific way to represent them for the purposes of efficiency. There are multiple standards for language tags, but the current specification RFC 5646 is convoluted by maintaining backwards-compatibility with the previous standards. I don't exactly understand the space requirements as there are multiple subtags.
What is the most space-efficient way to represent an IETF language code?
An IETF BCP 47 language tag is a standardized code or tag that is used to identify human languages in the Internet. The tag structure has been standardized by the Internet Engineering Task Force (IETF) in Best Current Practice (BCP) 47; the subtags are maintained by the IANA Language Subtag Registry.
A language tag is composed of a sequence of one or more subtags such as language, region, variant and script subtags. When a language tag is comprised of more than one subtag, the subtag values are separated by the "-" character.
RFC 3066 essentially allowed you to compose language tags that were either a language code on its own, a language code plus a country code, or one of a small number of specially registered values in the IANA language tag registry. RFC 5646 caters for more types of subtag, and allows you to combine them in various ways.
The standard locale for simplified Chinese is zh_CN . The standard locale for traditional Chinese is zh_TW .
I think IETF specs for handling the locale codes is indeed the industry "Best Common Practice", but definitely not without compromises to maintain backwards-compatibility and such. I still recommend adapting it to your needs since the most important internationalization libraries and standards (Unicode, ICU) are using it.
BCP47/RFC5646 section 4.4.1 recommends a 35 characters tag length:
language = 8 ; longest allowed registered value
; longer than primary+extlang
; which requires 7 characters
script = 5 ; if not suppressed: see Section 4.1
region = 4 ; UN M.49 numeric region code
; ISO 3166-1 codes require 3
variant1 = 9 ; needs 'language' as a prefix
variant2 = 9 ; very rare, as it needs
; 'language-variant1' as a prefix
total = 35 characters
Figure 7: Derivation of the Limit on Tag Length
But in case you only care about language and script (rather than region information which denotes some of locale-sensitive data like date and time formats), then you can make do with 13 characters max.
In reality most of the tags will end up being only two characters for the language. The only common examples which I deal with regularly and require script subtags are sr-Latn
and sr-Cyrl
(respectively, Serbian written in Latin or Cyrillic), zh-Hant
(Traditional Chinese), and zh-Hans
(Simplified Chinese). Also, most probably you will not need the variants which means that most of the real world examples of these locale codes should fall under a 17 characters limit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With