Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter chinese (ONLY chinese)

Tags:

python

I want to convert some text that include some punctuation and full-width symbols to pure chinese text.

maybe_re = re.compile("xxxxxxxxxxxxxxxxx") #TODO
print "".join(maybe_re.findall("你好,这只是一些中文文本..,.,全角"))

# I want out
你好这只是一些中文文本全角
like image 683
Dreampuf Avatar asked Aug 02 '11 10:08

Dreampuf


2 Answers

I don't know of any good way to separate Chinese characters from other letters, but you can distinguish letters from other characters. Using regexes, you can use r"\w" (compiled with the re.UNICODE flag if you're on Python 2). That will include numbers as well as letters, but not punctuation.

unicodedata.category(c) will tell you what type of character c is. Your Chinese letters are "Lo" (letter without case), while the punctuation is "Po".

like image 192
Thomas K Avatar answered Nov 13 '22 14:11

Thomas K


The Zhon library provides you with a list of Chinese punctuation marks: https://pypi.python.org/pypi/zhon

str = re.sub('[%s]' % zhon.unicode.PUNCTUATION, "", "你好,这只是一些中文文本..,.,全角")

This does almost what you want. Not exactly, because the sentence you provide contains some very non-standard punctuation marks, such as ".". Anyway, I think Zhon might be useful to others with a similar issue.

like image 45
Régis B. Avatar answered Nov 13 '22 14:11

Régis B.