I have a file with Korean and chinese characters. I want to find pairs where parenthetical statements are used to give the hanja for a Korean word, like this: 한문 (漢文)
The search would look something like this: /[korean characters] \([chinese characters]\)/
How do I specify the Chinese or Korean characters, or any other set such as Cyrillic or Thai for example?
Unicode provides properties that identify to which script characters belong. Characters can be matched based on their script property using \p{Script=...}
.
I don't know much about the languages you mentioned, but I think you want
\p{Script=Han}
aka \p{Han}
for Chinese.\p{Script=Hangul}
aka \p{Hangul}
for Korean.\p{Script=Cyrillic}
aka \p{Cyrl}
for Cyrillic.\p{Script=Thai}
aka \p{Thai}
for Thai.You could take a look at perluniprops to find the one you are looking for, or you could use uniprops
* to find which properties match a specific character.
$ uniprops D55C
U+D55C ‹한› \N{HANGUL SYLLABLE HAN}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InHangulSyllables L Lo
Gr_Base Grapheme_Base Graph GrBase Hang Hangul Hangul_Syllables
ID_Continue IDC ID_Start IDS Letter L_ Other_Letter Print Word
XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha
X_POSIX_Graph X_POSIX_Print X_POSIX_Word
To find out which characters are in a given property, you can use unichars
*. (This is of limited usefulness since most CJK chars aren't named.)
$ unichars -au '\p{Han}'
⺀ U+2E80 CJK RADICAL REPEAT
⺁ U+2E81 CJK RADICAL CLIFF
⺂ U+2E82 CJK RADICAL SECOND ONE
⺃ U+2E83 CJK RADICAL SECOND TWO
⺄ U+2E84 CJK RADICAL SECOND THREE
...
$ unichars -au '\p{Hangul}'
ᄀ U+01100 HANGUL CHOSEONG KIYEOK
ᄁ U+01101 HANGUL CHOSEONG SSANGKIYEOK
ᄂ U+01102 HANGUL CHOSEONG NIEUN
ᄃ U+01103 HANGUL CHOSEONG TIKEUT
ᄄ U+01104 HANGUL CHOSEONG SSANGTIKEUT
...
* — uniprops
and unichars
are available from the Unicode::Tussle distro.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With