unicode endian puzzled me

Tags:

i edit three files which have same content "你"(is you in english) in it in three different forms--gbk\utf-8\ucs-2 with gedit named "ok1,ok2,ok3".

>>> f1 = open('ok1', 'rb').read()
>>> f2 = open('ok2', 'rb').read()
>>> f3 = open('ok3', 'rb').read()
>>> f1
'\xc4\xe3\n'
>>> f2
'\xe4\xbd\xa0\n'
>>> f3
'`O\n\x00'
>>> hex(ord("`"))
'0x60'
>>> hex(ord("O")) 
'0x4f'

in fact f3 is '\x60\x4f', but the following output confused me

>>> '\xe4\xbd\xa0'.decode("utf-8")
u'\u4f60'
>>> '\xc4\xe3'.decode("gbk")
u'\u4f60'
>>>

why only there is endian problem in ucs-2(or say unicode) ,not in utf-8,not in gbk?

339

asked Sep 08 '12 06:09

Dd Pp

2 Answers

UTF-8 and GBK store data in a sequence of bytes. It is strongly defined which byte value comes after which in these encodings. This byte order does not change with the architecture used in coding, transmission or decoding.

On the other hand, UCS-2 or the new UTF-16 store data in sequences of 2-bytes. The order of individual bytes within these 2-byte tokens is the endianness and it depends on the underlying machine architecture. Systems must have an agreement on how to identify the endianness of tokens before communicating with data encoded in UCS-2.

In your case, Unicode point U+4F60 is coded in UCS-2 as a single 2-byte token 0x4F60. Since your machine puts the least significant byte before the most significant one in memory alignment, the sequence ('0x60', '0x4F') has been put into the file. Thus, file read will yield the bytes in this order.

Python can still decode this data correctly since it will read the bytes in correct order before forming the 2-byte token:

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'

110

answered Sep 17 '22 22:09

Tugrul Ates

Endian-ness only applies to multi-byte words, but UTF-8 uses units of 8 bits to encode information (that's what the 8 in the name stands for). There never is the question of confusion of ordering there.

Sometimes it may need more than one of those units to encode information, but they are considered distinct. The letter A is one byte, 0x41, for example. When it has to encode a character with more bytes, it uses a leading indicator byte, followed by extra continuation bytes to capture all the information needed for that character. Logically, these are distinct units.

GBK uses a similar scheme; characters use units of 1 byte, and just like UTF-8, a second byte can be used for some of the characters.

UCS-2 (and it's successor, UTF-16) on the other hand, is a 2-byte format. It encodes information in units of 16 bits, and those 16 bits always go together. The 2 bytes in that unit belong together logically, and modern architectures treat these as one unit, and thus have made a decision in what order they are stored. That's where endianess comes in, the order of the 2 bytes in a unit is architecture dependant. In your architecture, the bytes are ordered using little-endianess, meaning that the 'smaller' byte goes first. This is why the 0x4F byte comes before the 0x60 byte in your file.

Note that python can read either big or little endian UTF-16 just fine; you can pick the endianess explicitly if there is no indicator character at the start (the Byte Order Mark, or BOM):

>>> '`O\n\x00'.decode('utf-16')
u'\u4f60\n'
>>> '`O\n\x00'.decode('utf-16-le')
u'\u4f60\n'
>>> 'O`\x00\n'.decode('utf-16-be')
u'\u4f60\n'

In the latter example the bytes have been reversed, and decoded as big-endian.

answered Sep 19 '22 22:09

Martijn Pieters

Related questions
                            
                                Unable to understand this python decorator
                            
                                how to write simultaneous subscript and superscript for a symbol with matplotlib
                            
                                bcrypt in python [closed]
                            
                                Using WordNet to determine semantic similarity between two texts?
                            
                                CryptoJS and Pycrypto working together
                            
                                Generate all leaf-to-root paths in a dictionary tree in Python
                            
                                Scrapy: Can't override __init__function
                            
                                How many function calls does it take to create a class instance? [closed]
                            
                                Identifying serial/usb device python
                            
                                Is it possible to flush memory on Heroku dynos?
                            
                                Counting unique words in python
                            
                                QTreeWidget select first item
                            
                                Can't write long JSON output to text file
                            
                                Where is os.path.join(os.path.dirname(__file__), 'data') in Linux/Windows?
                            
                                Using Cython to speed up connected components algorithm
                            
                                'axes' parameter in scipy.ndimage.interpolation.rotate
                            
                                Is there a model type can store tags in django?
                            
                                Symbolic Integration in Python using Sympy
                            
                                Code style - 'hiding' functions inside other functions
                            
                                How to import variables defined in __init__.py?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

unicode endian puzzled me

Tags:

python

encoding

utf-8

endianness

ucs2

Dd Pp

People also ask

2 Answers

Tugrul Ates

Martijn Pieters

Recent Activity

Donate For Us