Programming tips with Japanese Language/Characters [closed]

Question

I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.

My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.

However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:

けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.

My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.

si28719e · Accepted Answer

Stick to Unicode and utf-8 everywhere.
Stay away from the native Japanese encodings: euc-jp, shiftjis, iso-2022-jp, but be aware that you'll probably encounter them at some point if you continue.
Get familiar with a segmenter for doing complicated stuff like POS analysis, word segmentation, etc. the standard tools used by most people who do NLP (natural language processing) work on Japanese are, in order of popularity/power.

MeCab (originally on SourceForge) is awesome: it allows you to take text like,

「日本語は、とても難しいです。」

and get all sorts of great info back

kettle:~$ echo 日本語は、難しいです | mecab 
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は   助詞,係助詞,*,*,*,*,は,ハ,ワ
、   記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です  助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS

which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,

kettle:~$ echo メキシコ料理が食べたい | mecab 
メキシコ    名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理  名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が   助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ  動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい  助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS

However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...

Apt-repos for ubuntu:

deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all

Packages to install: $ apt-get install mecab-ipadic-utf8 mecab python-mecab

should do the trick I think.

The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.

I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.

MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.

Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.

Programming tips with Japanese Language/Characters [closed]

Tags:

javascript

language-agnostic

unicode

nlp

cjk

Buddy Lindsey

1 Answers

si28719e

Recent Activity

Donate For Us

Programming tips with Japanese Language/Characters [closed]

Tags:

javascript

language-agnostic

unicode

nlp

cjk

Buddy Lindsey

1 Answers

si28719e

Related questions

Recent Activity

Donate For Us