Newline control characters in multi-byte character sets

Tags:

I have some Perl code that translates new-lines and line-feeds to a normalized form. The input text is Japanese, so that there will be multi-byte characters.

Is it still possible to do this transformation on a byte-by-byte basis (which I think it currently does), or do I have to detect the character set and enable Unicode support? In other words, are the popular encodings (Shift-JIS, EUC-JP, UTF-8, ISO-2022-JP) using bytes as part of their character set that could be mistaken for ASCII control characters?

I need only CR and LF to work.

Update: Added ISO-2022-JP. And that is the one that looks the most troublesome with its funky escape sequences ...

821

asked Apr 07 '09 05:04

Thilo

2 Answers

None of the 4 encodings that you mention (Shift-JIS, UTF-8, EUC-JP, ISO-2022-JP) use the CR or LF character inside Japanese characters. For UTF-8 and EUC-JP, there is no overlap whatsoever between low ascii characters and bytes inside Japanese characters. However, for Shift-JIS, and ISO-2022-JP, there is overlap, but not in the range where you find CR and LF.

For ISO-2022-JP,
First-byte range: 0x21 - 0x7E
Second-byte range: 0x21 - 0x7E

And the escape sequence characters to switch back and forth between various character sets are:

0x1B, 0x28, 0x24, 0x40, 0x42, and 0x4A

As you can see, none of the characters used to encode Japanese characters in ISO-2022-JP overlap with CR or LF.

For Shift-JIS,
First-byte range: 0x81 - 0x9F, 0xE0 - 0xEF
Second-byte range: 0x40 - 0x7E, 0x80 - 0xFC
Half-width katakana: 0xA1 - 0xDF

Again, there is no overlap with CR and LF.

answered Sep 23 '22 01:09

保田ジェフリー

All of those character sets are identical to ASCII for the first 128 code points--that is, they only use one byte to encode ASCII characters, including CR (0x0D) and LF (0x0A). You shouldn't have any problem.

answered Sep 20 '22 01:09

Alan Moore

Related questions
                            
                                How to pass Unicode title to matplotlib?
                            
                                Converting XML illegal &char to utf8 - python
                            
                                How do I tell dict() in Python 2 to use unicode instead of byte string?
                            
                                How to force the visual studio to use the wmain instead of main
                            
                                How can I reduce the number of test cases ScalaCheck generates?
                            
                                How do I convert decorated latin unicode characters to plain latin in python
                            
                                Porting a unicode enabled Delphi 2006 application to Delphi 2009
                            
                                Is there some functionality in/for Delphi that converts a string with html named and numbered entities to unicode text?
                            
                                UTF8 Beginning of File characters are breaking serializer & readers
                            
                                Query MS SQL for empty spaces(&nbsp; or \xa0)
                            
                                Can't store UTF-8 Content in MySQL Using Java PreparedStatement
                            
                                Is there any way to avoid showing "xn--" for IDN domains?
                            
                                Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows
                            
                                How do I decode unicode one line at a time in Python 2.7?
                            
                                How to convert from utf-16 to utf-32 on Linux with std library?
                            
                                How to iterate through unicode characters and print them on the screen with printf in C?
                            
                                Why is the same character compared twice by changing its case to UPPER and then to lower?
                            
                                Insert rows with Unicode characters using BCP
                            
                                how to draw fontawesome(version >=5.0) in canvas?
                            
                                \s doesn't actually capture all whitespace characters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Newline control characters in multi-byte character sets

Tags:

newline

character-encoding

unicode

cjk

Thilo

People also ask

2 Answers

保田ジェフリー

Alan Moore

Recent Activity

Donate For Us