Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to mark all CJK text in a document?

I have a file, file1.txt, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:

The 恐龙 ate 鱼.

As this contains text in Chinese characters, this will get marked like this:

The \language[cn]{恐龙} ate \language[cn]{鱼}.
  • The document is saved as UTF-8.
  • Text in Chinese should be marked \language[cn]{*}.
  • Text in Japanese should be marked \language[ja]{*}.
  • Text in Korean should be marked \language[ko]{*}.
  • The content never continues from one line to the next.
  • If the code is ever in doubt about whether something is Chinese, Japanese, or Korean, it is best if it defaults to Chinese.

How can I mark the text according to the language present?

like image 413
Village Avatar asked May 07 '12 13:05

Village


2 Answers

A crude algorithm:

use 5.014;
use utf8;
while (<DATA>) {
    s
        {(\p{Hangul}+)}
        {\\language[ko]{$1}}g;
    s
        {(\p{Hani}+)}
        {\\language[zh]{$1}}g;
    s
        {(\p{Hiragana}+|\p{Katakana}+)}
        {\\language[ja]{$1}}g;
    say;
}

__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.

(Also see Detect chinese character using perl?)

There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.

like image 65
daxim Avatar answered Sep 22 '22 01:09

daxim


I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).

    # coding=utf8
    import unicodedata2
    text=u"""The恐龙ate鱼.
    The 恐竜ate 魚.
    Theキョウリュウ ate うお.
    The공룡 ate 물고기. """

    langs = {
    'Han':'cn',
    'Katakana':'ja',
    'Hiragana':'ja',
    'Hangul':'ko'
    }

    alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
    # Add Last
    alist.append(("",""))
    newlist = []
    langlist = []
    prevlang = ""
    for raw, lang in alist:
        if prevlang in langs and prevlang != lang:
            newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
            langlist = []

        if lang not in langs:
            newlist.append(raw)
        else:                      
            langlist.append(raw)
        prevlang = lang

    newtext = "".join(newlist)
    print newtext

The Output is :

    $ python test.py 
    The\language[cn]{恐龙}ate\language[cn]{鱼}.
    The \language[cn]{恐竜}ate \language[cn]{魚}.
    The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
    The\language[ko]{공룡} ate \language[ko]{물고기}.
like image 44
wuliang Avatar answered Sep 22 '22 01:09

wuliang