Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent browser converting '\n' between lines into space (for Chinese characters)

Tags:

html

browser

cjk

Converting newline into space makes sense for English, for example, the following HTML:

<p>
This is
a sentence.
</p>

We get the following after converting the newline into space in the browser:

This is a sentence.

This is good for English, but not good for Chinese characters because we don't use spaces to separate words in Chinese. Here's an example (The Chinese sentence has the same meaning of "This is a sentence"):

<p>
这是
一句话。
</p>

I get the following result on Chrome, Safari and IE...

这是 一句话。

...but what I wanted is the following, without the extra space:

这是一句话。

I don't know why the browser does not ignore the newline if the last character of the current line and the first character of the next line are both Chinese characters (which I think makes more sense). Or they have provided this mechanism but need special handling?

BTW, in Vim, when using "J" to join lines, no space will be added if the last and the first character of the 2 lines are all Chinese characters. But for English, a space will be added. So I guess Vim does some special handling for this.

UPDATE:

Though I think this is an issue with the browser, I have to live with that. So currently I would preprocess my Markdown text to join Chinese lines before generating HTML. Here's how I do this in Ruby, complete code which also handles Chinese punctuations is on gist

#encoding: UTF-8

# Requires ruby 1.9.x, and assume using UTF-8 encoding

class String
  # The regular expression trick to match CJK characters comes from
  # http://stackoverflow.com/a/4681577/306935
  def join_chinese
    gsub(/(\p{Han})\n(\p{Han})/m, '\1\2')
  end
end
like image 422
cyfdecyf Avatar asked Dec 18 '11 06:12

cyfdecyf


2 Answers

Browsers treat newlines as spaces because the specifications say so, ever since HTML 2.0. In fact, HTML 2.0 was milder than later specifications; it said: “An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text.” (Conventional Representation of Newlines), whereas newer specifications say this stronger (describing it as what happens in HTML).

The background is that HTML and the Web was developed with mainly Western European languages in mind; this is reflected in many features of the original specifications and early implementations. Only slowly have they been internationalized.

It is unlikely that the parsing rules will be changed. More likely, what might happen is sensitivity to language or character properties rendering. This would mean that a line break still gets taken as a space (and the DOM string will contain Ascii space character), but a string like 这是 一句话。 would be rendered as if the space were not there. This what the HTML 4.01 specification seems to refer to (White space). The text is somewhat confused, but I think it tries to say that the behavior would depend in the content language, either inferred by the browser or as declared in markup.

But browsers don’t do such things yet. Declaring the language of content, e.g. <html lang=zh>, is a good principle but has little practical impact—in rendering, it may affect the browser’s choice of a default font (but how many authors let browsers use their default fonts?). It may even result in added spacing, if the space character happens to be wider in the browser’s default font for the language specified.

According to the CSS3 Text draft, you could use the text-spacing property. The value none “Turns off all text-spacing features. All fullwidth characters are set with full-width glyphs.” Unfortunately, no browser seems to support this yet.

like image 142
Jukka K. Korpela Avatar answered Sep 28 '22 07:09

Jukka K. Korpela


There is a way to solve this problem (classic workaround). In order to restrict (current) browsers of interpreting the line-break as a whitespace you have to set the font-size to 0.

For the child elements you have to set the font-size to its initial value again. So for your code an example would be:

<p class="nowhitespace">
  <span>这是</span>
  <span>一句话。</span>
</p>

The CSS could contain code like the following:

.nowhitespace { font-size: 0; }
.nowhitespace > span { font-size: 16px; }
like image 42
Florian Rappl Avatar answered Sep 28 '22 05:09

Florian Rappl