I have a file, file1.txt
, containing text in English, Chinese, Japanese, and Korean. For use in ConTeXt, I need to mark each region of text within the file according to language, except for English, and output a new file, e.g., here is a sample line:
The 恐龙 ate 鱼.
As this contains text in Chinese characters, this will get marked like this:
The \language[cn]{恐龙} ate \language[cn]{鱼}.
\language[cn]{*}
.\language[ja]{*}
.\language[ko]{*}
.How can I mark the text according to the language present?
A crude algorithm:
use 5.014;
use utf8;
while (<DATA>) {
s
{(\p{Hangul}+)}
{\\language[ko]{$1}}g;
s
{(\p{Hani}+)}
{\\language[zh]{$1}}g;
s
{(\p{Hiragana}+|\p{Katakana}+)}
{\\language[ja]{$1}}g;
say;
}
__DATA__
The 恐龙 ate 鱼.
The 恐竜 ate 魚.
The キョウリュウ ate うお.
The 공룡 ate 물고기.
(Also see Detect chinese character using perl?)
There are problems with that. Daenyth comments that e.g. 恐竜 is misidentified as Chinese. I find it unlikely that you are really working with mixed English-CJK, and are just giving bad example text. Perform a lexical analysis first to differentiate Chinese from Japanese.
I'd like to provide a Python solution. No matter which language, it is based on Unicode Script information (from Unicode Database, aka UCD). Perl has rather detailed UCD compared to Python.
Python has no Script information opened in its "unicodedata" module. But someone has added it at here https://gist.github.com/2204527 (tiny and useful). My implementaion is based on it. BTW, it is not space sensitive(no need of any lexical analysis).
# coding=utf8
import unicodedata2
text=u"""The恐龙ate鱼.
The 恐竜ate 魚.
Theキョウリュウ ate うお.
The공룡 ate 물고기. """
langs = {
'Han':'cn',
'Katakana':'ja',
'Hiragana':'ja',
'Hangul':'ko'
}
alist = [(x,unicodedata2.script_cat(x)[0]) for x in text]
# Add Last
alist.append(("",""))
newlist = []
langlist = []
prevlang = ""
for raw, lang in alist:
if prevlang in langs and prevlang != lang:
newlist.append("\language[%s]{" % langs[prevlang] +"".join(langlist) + "}")
langlist = []
if lang not in langs:
newlist.append(raw)
else:
langlist.append(raw)
prevlang = lang
newtext = "".join(newlist)
print newtext
The Output is :
$ python test.py
The\language[cn]{恐龙}ate\language[cn]{鱼}.
The \language[cn]{恐竜}ate \language[cn]{魚}.
The\language[ja]{キョウリュウ} ate \language[ja]{うお}.
The\language[ko]{공룡} ate \language[ko]{물고기}.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With