Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - file.write() causes chinese text

Tags:

python-3.5

When I write a certain string to a file in an infinite loop, for example:

file = open('txt.txt', 'w')
while 1:
    file.write('colour')

It gives me all this chinese text: Picture

Why does this happen?

like image 886
Marcus W. Avatar asked Nov 09 '22 09:11

Marcus W.


1 Answers

You can get the same result by copy pasting colour several times in notepad then saving and reloading the file. There's nothing wrong with your python code. The bytes written to the file will look something like this (in hex):

63 CF 6C 6F 75 72  63 CF 6C 6F 75 72 ...

When notepad reads these bytes it needs to guess what they represent. It would ideally decode the text as utf-8 or ascii. Instead it sees a pattern in the bytes and guesses wrong.

I noticed that every pair of bytes corresponds to one chinese character. This suggests the encoding might be utf-16. The following test in python confirms that this is the case:

>>> original = 'colour' * 100
>>> original.encode('utf-8').decode('utf-16')
\u6f63\u6f6c\u7275\... # repeating

These code points correspond to 潣, 潬, and 牵 which is the same thing that notepad displays. So the issue is that notepad is incorrectly decoding your bytes as utf-16 instead of utf-8. This is reminiscent of the old Bush hid the facts bug.

like image 68
Trevor Merrifield Avatar answered Dec 06 '22 13:12

Trevor Merrifield