I have an idea for a few web apps to write to help me, and maybe others, learn Japanese better since I am studying the language.
My problem is the site will be in mostly english, so it needs to mix fluently Japanese Characters, usually hirigana and katakana, but later kanji. I am getting closer to accomplishing this; I have figured out that the pages and source files need to be unicode and utf-8 content types.
However, my problem comes in the actual coding. What I need is to manipulate strings of text that are kana. One example is:
けす I need to take that verb and convert it to the te-form けして. I would prefer to do this in javascript as it will help down the road to do more manipulation, but if I have to will just do DB calls and hold everything in a DB.
My question is not only how to do it in javascript, but what are some tips and strategies to doing these kinds of things in other languages, too. I am hoping to get more into doing language learning apps, but am lost when it comes to this.
MeCab (originally on SourceForge) is awesome: it allows you to take text like,
「日本語は、とても難しいです。」
and get all sorts of great info back
kettle:~$ echo 日本語は、難しいです | mecab
日本語 名詞,一般,*,*,*,*,日本語,ニホンゴ,ニホンゴ
は 助詞,係助詞,*,*,*,*,は,ハ,ワ
、 記号,読点,*,*,*,*,、,、,、
難しい 形容詞,自立,*,*,形容詞・イ段,基本形,難しい,ムズカシイ,ムズカシイ
です 助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
EOS
which is basically a detailed run-down of the parts-of-speech, readings, pronunciations, etc. It will also do you the favor of analyzing verb tenses,
kettle:~$ echo メキシコ料理が食べたい | mecab
メキシコ 名詞,固有名詞,地域,国,*,*,メキシコ,メキシコ,メキシコ
料理 名詞,サ変接続,*,*,*,*,料理,リョウリ,リョーリ
が 助詞,格助詞,一般,*,*,*,が,ガ,ガ
食べ 動詞,自立,*,*,一段,連用形,食べる,タベ,タベ
たい 助動詞,*,*,*,特殊・タイ,基本形,たい,タイ,タイ
EOS
However, the documentation is all in Japanese, and it's a bit complicated to set up and figure out how to format the output the way you want it. There are packages available for ubuntu/debian, and bindings in a bunch of languages including perl, python, ruby...
Apt-repos for ubuntu:
deb http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
deb-src http://cl.naist.jp/~eric-n/ubuntu-nlp intrepid all
Packages to install:
$ apt-get install mecab-ipadic-utf8 mecab python-mecab
should do the trick I think.
The other alternatives to mecab are, ChaSen, which was written years ago by the author of MeCab (who incidentally works at google now), and Kakasi, which is much less powerful.
I would definitely try to avoid rolling your own conjugation routines. the problem with this is just that it will require tons and tons of work, which others have already done, and covering all the edge cases with rules is, at the end of the day, impossible.
MeCab is statistically driven, and trained on loads of data. It employs a sophisticated machine learning technique called conditional random fields (CRFs) and the results are really quite good.
Have fun with the Japanese. I'm not sure how good your Japanese is, but if you need help with the docs for mecab or whatever feel free to ask about that as well. Kanji can be quite intimidating at the beginning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With