Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if a Chinese character is simplified or traditional in Python 3?

I'm wondering is there any method to check a Chinese character is simplified Chinese or traditional Chinese in Python 3?

like image 622
Shudong Avatar asked Sep 12 '15 17:09

Shudong


People also ask

How do you know if a Chinese is Simplified or traditional?

The most obvious difference between traditional Chinese and simplified Chinese is the way that the characters look. Traditional characters are typically more complicated and have more strokes, while simplified characters are, as the name suggests, simpler and have fewer strokes.

How do I check my Chinese characters?

Optical character recognition (OCR) – Many apps and websites provide OCR features where you can scan or take pictures of the character(s) you want to look up. Google Docs has such a feature and there are others online you can easily find by searching for “Chinese” and “OCR”.

What encoding does Chinese characters use Python?

The default encoding for Python 3 source code is UTF-8, and the language's str type contains Unicode characters, meaning any string created using “unicode rocks!”, 'unicode rocks! ', or the triple-quoted string syntax is stored as Unicode [6].


2 Answers

cjklib does not support Python 3. In Python 3, you can use hanzidentifier.

import hanzidentifier

print(hanzidentifier.has_chinese('Hello my name is John.'))
》 False

print(hanzidentifier.has_chinese('Country in Simplified: 国家. Country in Traditional: 國家.'))
》 True

print(hanzidentifier.is_simplified('John说:你好!'))
》 True

print(hanzidentifier.is_traditional('John說:你好!'))
》 True
like image 169
Hong Zher Tan Avatar answered Oct 05 '22 01:10

Hong Zher Tan


You can use getCharacterVariants() in cjklib to query the character's simplified (S) and traditional (T) variants. As described in the Unihan database documentation, you can use this data to determine the classification for a character.

like image 41
一二三 Avatar answered Oct 05 '22 01:10

一二三