Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Word wrap algorithms for Japanese

In a recent web application I built, I was pleasantly surprised when one of our users decided to use it to create something entirely in Japanese. However, the text was wrapped strangely and awkwardly. Apparently browsers don't cope with wrapping Japanese text very well, probably because it contains few spaces, as each character forms a whole word. However, that's not really a safe assumption to make as some words are constructed of several characters, and it is not safe to break some character groups into different lines.

Googling around hasn't really helped me understand the problem any better. It seems to me like one would need a dictionary of unbreakable patterns, and assume that everywhere else is safe to break. But I fear I don't know enough about Japanese to really know all the words, which I understand from some of my searching, are quite complicated.

How would you approach this problem? Are there any libraries or algorithms you are aware of that already exist that deal with this in a satisfactory way?

like image 979
Breton Avatar asked Jan 19 '10 00:01

Breton


1 Answers

Japanese word wrap rules are called kinsoku shori and are surprisingly simple. They're actually mostly concerned with punctuation characters and do not try to keep words unbroken at all.

I just checked with a Japanese novel and indeed, both words in the syllabic kana script and those consisting of multiple Chinese ideograms are wrapped mid-word with impunity.

like image 154
Michael Borgwardt Avatar answered Oct 23 '22 04:10

Michael Borgwardt