Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Newline control characters in multi-byte character sets

I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters.

Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?

I need only CR and LF to work.

Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences ...

like image 821
Thilo Avatar asked Apr 07 '09 05:04

Thilo


People also ask

Which control character is used for a new line?

LF (character : \n, Unicode : U+000A, ASCII : 10, hex : 0x0a): This is simply the '\n' character which we all know from our early programming days. This character is commonly known as the 'Line Feed' or 'Newline Character'.

Is a newline character one byte?

\n\r is 2 bytes.

Is line break a control character?

Newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc.

How many chars is a newline?

\n in java corresponds to only one character.

What is the Ascii code for newline?

In ASCII, newline is X'0A'. In EBCDIC, newline is X'15'. (For example, ASCII code page ISO8859-1 and EBCDIC code page IBM-1047 translate back and forth between these characters.) Windows programs normally use a carriage return followed by a line feed character at the end of each line of a text file.


2 Answers

None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.

For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E

And the escape sequence characters to switch back and forth between various character sets are:

0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A

As you can see, none of the characters used to encode Japanese characters in ISO-2022-JP overlap with CR or LF.

For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF

Again, there is no overlap with CR and LF.

like image 61
保田ジェフリー Avatar answered Sep 23 '22 01:09

保田ジェフリー


All of those character sets are identical to ASCII for the first 128 code points--that is, they only use one byte to encode ASCII characters, including CR (0x0D) and LF (0x0A). You shouldn't have any problem.

like image 45
Alan Moore Avatar answered Sep 20 '22 01:09

Alan Moore