I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible. I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu. Let me outline what I have found, as preamble to my question: <ol> <li>The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order'). </li> <li>Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)</li> <li>The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.</li> </ol> So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana). I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though: <ul> <li>Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?</li> <li>Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.</li> <li>Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.</li> </ul> 宜しくお願いします。

just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick. We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly. So, what we did is: <ol> <li>We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).</li> <li>Then, for developer use, we wrote a ruby script that: <ol> <li>Uses mecab to translate the contents of that file into Japanese phonetic readings (the precise command used was <code>mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv</code>, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).</li> <li>Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: <code>NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8</code> </li> <li>Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.</li> </ol> </li> <li>Use the data from the resulting .csv file to seed our rails app with its built-in values.</li> </ol> From time to time the client updates the source data, so we will need to do this whenever that happens. As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves. UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list. Many thanks to all who helped me!

Can sorting Japanese kanji words be done programmatically?

Tags:

I've recently discovered, to my astonishment (having never really thought about it before), machine-sorting Japanese proper nouns is apparently not possible.

I work on an application that must allow the user to select a hospital from a 3-menu interface. The first menu is Prefecture, the second is City Name, and the third is Hospital. Each menu should be sorted, as you might expect, so the user can find what they want in the menu.

Let me outline what I have found, as preamble to my question:

The expected sort order for Japanese words is based on their pronunciation. Kanji do not have an inherent order (there are tens of thousands of Kanji in use), but the Japanese phonetic syllabaries do have an order: あ、い、う、え、お、か、き、く、け、こ... and on for the fifty traditional distinct sounds (a few of which are obsolete in modern Japanese). This sort order is called 五十音順 (gojuu on jun , or '50-sound order').
Therefore, Kanji words should be sorted in the same order as they would be if they were written in hiragana. (You can represent any kanji word in phonetic hiragana in Japanese.)
The kicker: there is no canonical way to determine the pronunciation of a given word written in kanji. You never know. Some kanji have ten or more different pronunciations, depending on the word. Many common words are in the dictionary, and I could probably hack together a way to look them up from one of the free dictionary databases, but proper nouns (e.g. hospital names) are not in the dictionary.

So, in my application, I have a list of every prefecture, city, and hospital in Japan. In order to sort these lists, which is a requirement, I need a matching list of each of these names in phonetic form (kana).

I can't come up with anything other than paying somebody fluent in Japanese (I'm only so-so) to manually transcribe them. Before I do so though:

Is it possible that I am totally high on fire, and there actually is some way to do this sorting without creating my own mappings of kanji words to phonetic readings, that I have somehow overlooked?
Is there a publicly available mapping of prefecture/city names, from the government or something? That would reduce the manual mapping I'd need to do to only hospital names.
Does anybody have any other advice on how to approach this problem? Any programming language is fine--I'm working with Ruby on Rails but I would be delighted if I could just write a program that would take the kanji input (say 40,000 proper nouns) and then output the phonetic representations as data that I could import into my Rails app.

宜しくお願いします。

714

asked Feb 04 '11 07:02

Mason

2 Answers

For Data, dig Google's Japanese IME (Mozc) data files here.

https://github.com/google/mozc/tree/master/src/data

There is lots of interesting data there, including IPA dictionaries.

Edit:

And you may also try Mecab, it can use IPA dictionary and can convert kanjis to katakana for most of the words

https://taku910.github.io/mecab/

and there is ruby bindings for that too.

https://taku910.github.io/mecab/bindings.html

and here is somebody tested, ruby with mecab with tagger -Oyomi

http://hirai2.blog129.fc2.com/blog-entry-4.html

119

answered Sep 23 '22 18:09

YOU

just a quick followup to explain the eventual actual solution we used. Thanks to all who recommended mecab--this appears to have done the trick.

We have a mostly-Rails backend, but in our circumstance we didn't need to solve this problem on the backend. For user-entered data, e.g. creating new entities with Japanese names, we modified the UI to require the user to enter the phonetic yomigana in addition to the kanji name. Users seem accustomed to this. The problem was the large corpus of data that is built into the app--hospital, company, and place names, mainly.

So, what we did is:

We converted all the source data (a list of 4000 hospitals with name, address, etc) into .csv format (encoded as UTF-8, of course).
Then, for developer use, we wrote a ruby script that:
1. Uses mecab to translate the contents of that file into Japanese phonetic readings (the precise command used was mecab -Oyomi -o seed_hospitals.converted.csv seed_hospitals.csv, which outputs a new file with the kanji replaced by the phonetic equivalent, expressed in full-width katakana).
2. Standardizes all yomikata into hiragana (because users tend to enter hiragana when manually entering yomikata, and hiragana and katakana sort differently). Ruby makes this easy once you find it: NKF.nkf("-h1 -w", katakana_str) # -h1 means to hiragana, -w means output utf8
3. Using the awesomely conveninent new Ruby 1.9.2 version of CSV, combine the input file with the mecab-translated file, so that the resulting file now has extra columns inserted, a la NAME, NAME_YOMIGANA, ADDRESS, ADDRESS_YOMIGANA, and so on.
Use the data from the resulting .csv file to seed our rails app with its built-in values.

From time to time the client updates the source data, so we will need to do this whenever that happens.

As far as I can tell, this output is good. My Japanese isn't good enough to be 100% sure, but a few of my Japanese coworkers skimmed it and said it looks all right. I put a slightly obfuscated sample of the converted addresses in this gist so that anybody who cared to read this far can see for themselves.

UPDATE: The results are in... it's pretty good, but not perfect. Still, it looks like it correctly phoneticized 95%+ of the quasi-random addresses in my list.

Many thanks to all who helped me!

answered Sep 22 '22 18:09

Mason

Related questions
                            
                                How to efficiently de-interleave bits (inverse Morton)
                            
                                What is the time complexity of tree traversal?
                            
                                Prevent manually added libraries from being deleted by ndk-build
                            
                                Mockito to test void methods
                            
                                What logging framework is better to use in F# code [closed]
                            
                                HTTP response header content disposition for attachments
                            
                                JavaScript - function as an object property
                            
                                How to simulate DDOS/Slashdotting?
                            
                                Getting initial selector inside jquery plugin
                            
                                ruby - extend module inside another module
                            
                                Is adding tasks to BlockingQueue of ThreadPoolExecutor advisable?
                            
                                If I want to use std::shared_ptr, which header to include? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can sorting Japanese kanji words be done programmatically?

Tags:

Mason

People also ask

2 Answers

YOU

Mason

Recent Activity

Donate For Us