What data type should I use for IETF language codes?

Tags:

I'm designing a schema for a message on a microblogging platform, which will need to have a defined language. These messages will be distributed across networks between many nodes, so I need to make the schema compact but still completely multilingual.

I'm going to use the IETF language codes (en, en-AU etc.), but I need to know if there is a specific way to represent them for the purposes of efficiency. There are multiple standards for language tags, but the current specification RFC 5646 is convoluted by maintaining backwards-compatibility with the previous standards. I don't exactly understand the space requirements as there are multiple subtags.

What is the most space-efficient way to represent an IETF language code?

604

asked Jul 25 '13 02:07

liamzebedee

1 Answers

I think IETF specs for handling the locale codes is indeed the industry "Best Common Practice", but definitely not without compromises to maintain backwards-compatibility and such. I still recommend adapting it to your needs since the most important internationalization libraries and standards (Unicode, ICU) are using it.

BCP47/RFC5646 section 4.4.1 recommends a 35 characters tag length:

   language      =  8 ; longest allowed registered value
                      ;   longer than primary+extlang
                      ;   which requires 7 characters
   script        =  5 ; if not suppressed: see Section 4.1
   region        =  4 ; UN M.49 numeric region code
                      ;   ISO 3166-1 codes require 3
   variant1      =  9 ; needs 'language' as a prefix
   variant2      =  9 ; very rare, as it needs
                      ;   'language-variant1' as a prefix

   total         = 35 characters

              Figure 7: Derivation of the Limit on Tag Length

But in case you only care about language and script (rather than region information which denotes some of locale-sensitive data like date and time formats), then you can make do with 13 characters max.

In reality most of the tags will end up being only two characters for the language. The only common examples which I deal with regularly and require script subtags are sr-Latn and sr-Cyrl (respectively, Serbian written in Latin or Cyrillic), zh-Hant (Traditional Chinese), and zh-Hans (Simplified Chinese). Also, most probably you will not need the variants which means that most of the real world examples of these locale codes should fall under a 17 characters limit.

144

answered Sep 20 '22 09:09

Shervin

Related questions
                            
                                Android (distributed application) primary key strategy
                            
                                Normalizing/validation for international data sets in a database?
                            
                                How to implement polymorphic associations in an existing database
                            
                                Persisting Enums in database tables
                            
                                MongoDB / NOSQL: Best approach to handling read/unread status on messages
                            
                                Does a Foreign Key referencing PK need the NOT NULL constraint?
                            
                                Efficiency of Hibernate's table-per-subclass inheritance strategy
                            
                                Why is it bad to use boolean flags in databases? And what should be used instead?
                            
                                "Follow user" database table design
                            
                                Custom SERIAL / autoincrement per group of values
                            
                                Database Design: how to model generic price factors of a product/service?
                            
                                Ordering columns in database tables
                            
                                should the user's Account balance be stored in the database or calculated dynamically?
                            
                                Is varchar(128) better than varchar(100)
                            
                                How does Trello store data in MongoDB? (Collection per board?)
                            
                                Postgresql one db with multiple schemas vs multiple db with one schema
                            
                                MongoDB Database Structure and Best Practices Help
                            
                                Task list with re-ordering feature using Firebase/Firestore
                            
                                Database per application VS One big database for all applications [closed]
                            
                                enums in SQL Server database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What data type should I use for IETF language codes?

Tags:

language-agnostic

types

database-design

internationalization

multilingual

liamzebedee

People also ask

1 Answers

Shervin

Recent Activity

Donate For Us