Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl regex find character from arbitrary set

Tags:

regex

perl

cjk

I have a file with Korean and chinese characters. I want to find pairs where parenthetical statements are used to give the hanja for a Korean word, like this: 한문 (漢文)

The search would look something like this: /[korean characters] \([chinese characters]\)/

How do I specify the Chinese or Korean characters, or any other set such as Cyrillic or Thai for example?

like image 555
Nate Glenn Avatar asked Jan 24 '12 00:01

Nate Glenn


1 Answers

Unicode provides properties that identify to which script characters belong. Characters can be matched based on their script property using \p{Script=...}.

I don't know much about the languages you mentioned, but I think you want

  • \p{Script=Han} aka \p{Han} for Chinese.
  • \p{Script=Hangul} aka \p{Hangul} for Korean.
  • \p{Script=Cyrillic} aka \p{Cyrl} for Cyrillic.
  • \p{Script=Thai} aka \p{Thai} for Thai.

You could take a look at perluniprops to find the one you are looking for, or you could use uniprops* to find which properties match a specific character.

$ uniprops D55C
U+D55C ‹한› \N{HANGUL SYLLABLE HAN}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo
    Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables
    ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word
    XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
    X_POSIX_Graph X_POSIX_Print X_POSIX_Word

To find out which characters are in a given property, you can use unichars*. (This is of limited usefulness since most CJK chars aren't named.)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
...

$ unichars -au '\p{Hangul}'
 ᄀ U+01100 HANGUL CHOSEONG KIYEOK
 ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK
 ᄂ U+01102 HANGUL CHOSEONG NIEUN
 ᄃ U+01103 HANGUL CHOSEONG TIKEUT
 ᄄ U+01104 HANGUL CHOSEONG SSANGTIKEUT
...

* — uniprops and unichars are available from the Unicode::Tussle distro.

like image 146
ikegami Avatar answered Nov 15 '22 10:11

ikegami