What are the limitations of primitive character types in D?

Tags:

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and limitations of the language in this area.

The types are given on the website as:

char;    // unsinged 8 bit UTF-8
wchar;   // unsigned 16 bit UTF-16
dchar;   // unsigned 32 bit UTF-32

Since we know that most of the Unicode Transformation (UTF) Format encodings represent characters with a variable bit-width, does this mean that a char in D can only contain the values that will fit in 8 bits, or does it expand in the machine's physical memory when you give it double byte characters? Perhaps there is some other possibility, like automatic casting into the next most appropriate type as you overload the variable?

Let's say for example, I want to use the UTF-8 char in an editor and type in Chinese . Will it simply fall over, or is it able to deal with Unicode characters more 'correctly', like in C#? Would it still be necessary to provide glue code to allow working with any language supported by Unicode?

I'd appreciate any specific information you can offer on how these types work under the covers, and any general best practices advice on dealing with their limitations.

713

asked Jul 12 '09 17:07

Ian Gilham

1 Answers

A single char or wchar represents an UTF code unit. This means that, by its own, a char in can either represent an ASCII symbol (0-127) or be part of an UTF-8 sequence representing an Unicode character (code point). Only the dchar type can represent an entire Unicode character, because there are more than 65536 code points in Unicode.

Casting one type of string type (string, wstring and dstring, which are simply dynamic arrays of the character types) will not automatically convert their contents to the respective UTF representation. In order to do this, you must use the functions toUTF8, toUTF16 and toUTF32 from std.utf (or toString / toString16 / toString32 from tango.text.convert.Utf if you use Tango).

Users have implemented string classes which will automatically use the most memory-efficient representation that can map each character to a single code unit. This allows quick slicing and indexing with a minimal memory overhead. One such implementation is mtext by Christopher E. Miller.

Vladimir Panteleev

Related questions
                            
                                Matching Unicode word boundaries in Python
                            
                                How do I set unicode as character set in the ALL_BUILD and ZERO_CHECK Visual Studio 2013 projects that are generated by Cmake?
                            
                                How to get character by its (unicode) name in Java? I need the reverse of Character.getName(int codePoint)
                            
                                What's the unicode glyph used to indicate combining characters?
                            
                                Javascript: Non-unicode char code to unicode character?
                            
                                How to make MySQL aware of multi-byte characters in LIKE and REGEXP?
                            
                                Unicode string literals
                            
                                What is the difference between ‘combining characters’ and ‘grapheme extenders’ in Unicode?
                            
                                Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?
                            
                                How do I escape unicode character 0x1F in xml?
                            
                                MySQL - Illegal mix of collations (utf8_general_ci,COERCIBLE) and (latin1_swedish_ci,IMPLICIT) for operation 'UNION'
                            
                                Emacs, unicode, xterm mouse escape sequences, and wide terminals
                            
                                Python: Creating a Unicode string
                            
                                Seeking istreambuf_iterator <wchar_t> clarifications, reading a complete text file of Unicode characters
                            
                                What is the unicode variation selector
                            
                                How to iterate over Unicode characters in Python 3?
                            
                                Does every browser support all unicode? [closed]
                            
                                Use Agda's input method in other emacs mode?
                            
                                How do I send Unicode text from MATLAB into a Word document via the ActiveX interface?
                            
                                SVG text element with Unicode characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are the limitations of primitive character types in D?

Tags:

unicode

utf-8

primitive-types

utf

d

Ian Gilham

People also ask

1 Answers

Vladimir Panteleev

Recent Activity

Donate For Us