Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out Chinese or Japanese Character in a String in Python?

Such as:

str = 'sdf344asfasf天地方益3権sdfsdf'

Add () to Chinese and Japanese Characters:

strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'
like image 640
Sam Avatar asked May 06 '15 07:05

Sam


2 Answers

As a start, you can check if the character is in one of the following unicode blocks:

  • Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF
  • Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF
  • Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF
  • Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F
  • Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:

# -*- coding:utf-8 -*-
ranges = [
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
  {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
  {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
  {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
  {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
  {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
  {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
  {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
  {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
  {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
  {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
  {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
]

def is_cjk(char):
  return any([range["from"] <= ord(char) <= range["to"] for range in ranges])

def cjk_substrings(string):
  i = 0
  while i<len(string):
    if is_cjk(string[i]):
      start = i
      while is_cjk(string[i]): i += 1
      yield string[start:i]
    i += 1

string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
  string = string.replace(sub, "(" + sub + ")")
print string

The above prints

sdf344asfasf(天地方益)3(権)sdfsdf

To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

[EDIT]

Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

like image 168
EvenLisle Avatar answered Sep 21 '22 02:09

EvenLisle


You can do the edit using the regex package, which supports checking the Unicode "Script" property of each character and is a drop-in replacement for the re package:

import regex as re

pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)

input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output  # Prints: sdf344asfasf(天地方益)3(権)sdfsdf

You should adjust the \p{Is...} sequences with the character scripts/blocks that you consider to be "Chinese or Japanese".

like image 45
一二三 Avatar answered Sep 20 '22 02:09

一二三