The following code tests if characters in a string are all Chinese characters. It works for Python 3 but not for Python 2.7. How do I do it in Python 2.7?
for ch in name:
if ord(ch) < 0x4e00 or ord(ch) > 0x9fff:
return False
# byte str (you probably get from GAE)
In [1]: s = """Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related
language varieties, several of which are not mutually intelligible,"""
# unicode str
In [2]: us = u"""Chinese (汉语/漢語 Hànyǔ or 中文 Zhōngwén) is a group of related
language varieties, several of which are not mutually intelligible,"""
# convert to unicode using str.decode('utf-8')
In [3]: print ''.join(c for c in s.decode('utf-8')
if u'\u4e00' <= c <= u'\u9fff')
汉语漢語中文
In [4]: print ''.join(c for c in us if u'\u4e00' <= c <= u'\u9fff')
汉语漢語中文
To make sure all the characters are Chinese, something like this should do:
all(u'\u4e00' <= c <= u'\u9fff' for c in name.decode('utf-8'))
In your python application, use unicode internally - decode early & encode late - creating a unicode sandwich.
This works fine for me in Python 2.7, provided name
is a unicode()
value:
>>> ord(u'\u4e00') < 0x4e00
False
>>> ord(u'\u4dff') < 0x4e00
True
You do not have to use ord
here if you compare the character directly with unicode values:
>>> u'\u4e00' < u'\u4e00'
False
>>> u'\u4dff' < u'\u4e00'
True
Data from an incoming request will not yet have been decoded to unicode, you'll need to do that first. Explicitly set the accept-charset
attribute on your form tag to ensure that the browser uses the correct encoding:
<form accept-charset="utf-8" action="...">
then decode the data on the server side:
name = self.request.get('name').decode('utf8')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With