This character - ㎜ - raises a UnicodeEncodeError

Question

I am using a Python script to convert files from gb2312 to utf-8. This character messes everything: ㎜ (it is one symbol, not "mm").

text = '㎜'
text.encode(encoding='gb2312')

raises this error:

UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence

I can use workaround by text.replace('㎜', 'mm'). But what if there are others such characters? What is wrong with it? Why it is so special?

Is there a way to make Python treat it as any other character?

zwol · Accepted Answer

OK, so, I downloaded the file 1.php and ran your original script on it and I get a different error mesage:

UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
  illegal multibyte sequence

The bytes in the file at offsets 99 and 100 are A9 4C in that order. That is neither a valid GB2312 nor a valid UTF-8 encoding of anything. I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding. If you need to just bull through all such problems, you can use errors='replace' and mode='rU' (the latter makes Python understand your DOS newlines).

file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')

This will insert U+FFFD REPLACEMENT CHARACTER in place of anything it can't decode, and continue. This destroys data; first try to figure out what the real encoding of the file is.

By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is

<!doctype html>
<html><head>
  <meta charset="utf-8">

Concise, standard compliant, and tested to work all the way back to IE6.

EDIT: On further investigation, GB2312 is a character set, not an encoding. There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character 呶. (I do not know any of the languages that use Chinese characters; does that make more sense in context than ㎜?)

Python and iconv assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise. If I modify your script to read

file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
file_new=open('2.php', mode='w', encoding='utf-8')
file_new.write(file_old.read())

then it executes without error on the 1.php you provided.

EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312"> is pretend you wrote <meta charset="gbk">. GBK is a superset of GB2312 that does include the ㎜ character. Python, however, treats GB2312 per its original definition. So what you really want in order for your conversion to match the original file is

file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')

This character - ㎜ - raises a UnicodeEncodeError

Tags:

python

python-3.x

encoding

unicode

gb2312

Qiao

1 Answers

zwol

Recent Activity

Donate For Us

This character - ㎜ - raises a UnicodeEncodeError

Tags:

python

python-3.x

encoding

unicode

gb2312

Qiao

1 Answers

zwol

Related questions

Recent Activity

Donate For Us