Currently I'm trying to read a file in a mime format which has some binary string data of a png. In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture. <hr> An example after reading the file in Windows is below: <pre class="prettyprint"><code> --fh-mms-multipart-next-part-1308191573195-0-53229 Content-Type: image/png;name=app_icon.png Content-ID: "<app_icon>" content-location: app_icon.png &permil;PNG </code></pre> etc...etc... An example after reading the file in Linux is below: <pre class="prettyprint"><code> --fh-mms-multipart-next-part-1308191573195-0-53229 Content-Type: image/png;name=app_icon.png Content-ID: "<app_icon>" content-location: app_icon.png ï¿½PNG </code></pre> etc...etc... <hr> I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols. Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

In Java, <code>String</code> ≠ <code>byte[]</code>. <ul> <li> <code>byte[]</code> represents raw binary data.</li> <li> <code>String</code> represents text, which has an associated charset/encoding to be able to tell which characters it represents.</li> </ul> Binary Data ≠ Text. Text data inside a <code>String</code> has Unicode/UTF-16 as charset/encoding (or Unicode/mUTF-8 when serialized). Whenever you convert from something that is not a <code>String</code> to a <code>String</code> or viceversa, you need to specify a charset/encoding for the non-<code>String</code> text data (even if you do it implicitly, using the platform's default charset). A PNG file contains raw binary data that represents an image (and associated metadata), not text. Therefore, you should not treat it as text. <code>\x89PNG</code> is not text, it's just a "magic" header for identifying PNG files. <code>0x89</code> isn't even a character, it's just an arbitrary byte value, and its only sane representations for display are things like <code>\x89</code>, <code>0x89</code>, ... Likewise, <code>PNG</code> there is in reality binary data, it could as well have been <code>0xdeadbeef</code> and it would have changed nothing. The fact that <code>PNG</code> happens to be human-readable is just a convenience. Your problem comes from the fact that your protocol mixes text and binary data, while Java (unlike some other languages, like C) treats binary data differently than text. Java provides <code>*InputStream</code> for reading binary data, and <code>*Reader</code> for reading text. I see two ways to deal with input: <ul> <li>Treat everything as binary data. When you read a whole text line, convert it into a <code>String</code>, using the appropriate charset/encoding.</li> <li>Layer a <code>InputStreamReader</code> on top of a <code>InputStream</code>, access the <code>InputStream</code> directly when you want binary data, access the <code>InputStreamReader</code> when you want text.</li> </ul> You may want buffering, the correct place to put it in the second case is below the <code>*Reader</code>. If you used a <code>BufferedReader</code>, the <code>BufferedReader</code> would probably consume more input from the <code>InputStream</code> than it should. So, you would have something like: <pre class="prettyprint"><code> ┌───────────────────┐ │ InputStreamReader │ └───────────────────┘ ↓ ┌─────────────────────┐ │ BufferedInputStream │ └─────────────────────┘ ↓ ┌─────────────┐ │ InputStream │ └─────────────┘ </code></pre> You would use the <code>InputStreamReader</code> to read text, then you would use the <code>BufferedInputStream</code> to read an appropriate amount of binary data from the same stream. A problematic case is recognizing both <code>"\r"</code> (old MacOS) and <code>"\r\n"</code> (DOS/Windows) as line terminators. In that case, you may end up reading one character too much. You could take the approach that the deprecated <code>DataInputStream.readline()</code> method took: transparently wrap the internal <code>InputStream</code> into a <code>PushbackInputStream</code> and unread that character. However, since you don't appear to have a Content-Length, I would recommend the first way, treating everything as binary, and convert to <code>String</code> only after reading a whole line. In this case, I would treat the MIME delimiter as binary data. Output: Since you are dealing with binary data, you cannot just <code>println()</code> it. <code>PrintStream</code> has <code>write()</code> methods that can deal with binary data (e.g: for outputting to a binary file). Or maybe your data has to be transported on a channel that treats it as text. Base64 is designed for that exact situation (transporting binary data as ASCII text). Base64 encoded form uses only US_ASCII characters, so you should be able to use it with any charset/encoding that is a superset of US_ASCII (ISO-8859-*, UTF-8, CP-1252, ...). Since you are converting binary data to/from text, the only sane API for Base64 would be something like: <pre class="prettyprint"><code>String Base64Encode(byte[] data); byte[] Base64Decode(String encodedData); </code></pre> which is basically what the internal <code>java.util.prefs.Base64</code> uses. Conclusion: In Java, <code>String</code> ≠ <code>byte[]</code>. Binary Data ≠ Text.

Reading File from Windows and Linux yields different results (character encoding?)

Tags:

java

character-encoding

unicode

png

Currently I'm trying to read a file in a mime format which has some binary string data of a png.

In Windows, reading the file gives me the proper binary string, meaning I just copy the string over and change the extension to png and I see the picture.

An example after reading the file in Windows is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

    ‰PNG

etc...etc...

An example after reading the file in Linux is below:

    --fh-mms-multipart-next-part-1308191573195-0-53229
     Content-Type: image/png;name=app_icon.png
     Content-ID: "<app_icon>"
     content-location: app_icon.png

     ï¿½PNG

etc...etc...

I am not able to convert the Linux version into a picture as it all becomes some funky symbols with a lot of upside down "?" and "1/2" symbols.

Can anyone enlighten me on what is going on and maybe provide a solution? Been playing with the code for a week and more now.

509

asked Jun 16 '11 03:06

Maurice

2 Answers

ï¿½ is a sequence of three characters - 0xEF 0xBF 0xBD, and is UTF-8 representation of the Unicode codepoint 0xFFFD. The codepoint in itself is the replacement character for illegal UTF-8 sequences.

Apparently, for some reason, the set of routines involved in your source code (on Linux) is handling the PNG header inaccurately. The PNG header starts with the byte 0x89 (and is followed by 0x50, 0x4E, 0x47), which is correctly handled in Windows (which might be treating the file as a sequence of CP1252 bytes). In CP1252, the 0x89 character is displayed as ‰.

On Linux, however, this byte is being decoded by a UTF-8 routine (or a library that thought it was good to process the file as a UTF-8 sequence). Since, 0x89 on it's own is not a valid codepoint in the ASCII-7 range (ref: the UTF-8 encoding scheme), it cannot be mapped to a valid UTF-8 codepoint in the 0x00-0x7F range. Also, it cannot be mapped to a valid codepoint represented as a multi-byte UTF-8 sequence, for all of multi-byte sequences start with a minimum of 2 bits set to 1 (11....), and since this is the start of the file, it cannot be a continuation byte as well. The resulting behavior is that the UTF-8 decoder, now replaces 0x89 with the UTF-8 replacement characters 0xEF 0xBF 0xBD (how silly, considering that the file is not UTF-8 to begin with), which will be displayed in ISO-8859-1 as ï¿½.

If you need to resolve this problem, you'll need to ensure the following in Linux:

Read the bytes in the PNG file, using the suitable encoding for the file (i.e. not UTF-8); this is apparently necessary if you are reading the file as a sequence of characters^*, and not necessary if you are reading bytes alone. You might be doing this correctly, so it would be worthwhile to verify the subsequent step(s) also.
When you are viewing the contents of the file, use a suitable editor/view that does not perform any internal decoding of the file to a sequence of UTF-8 bytes. Using a suitable font will also help, for you might want to prevent the unprecedented scenario where the glyph (for 0xFFFD it is actually the diamond character �) cannot be represented, and might result in further changes (unlikely, but you never know how the editor/viewer has been written).
It is also a good idea to write the files out (if you are doing so) in the suitable encoding - ISO-8859-1 perhaps, instead of UTF-8. If you are processing and storing the file contents in memory as bytes instead of characters, then writing these to an output stream (without the involvement of any String or character references) is sufficient.

^* Apparently, the Java Runtime will perform decoding of the byte sequence to UTF-16 codepoints, if you convert a sequence of bytes to a character or a String object.

158

answered Nov 12 '22 03:11

Vineet Reynolds

In Java, String ≠ byte[].

byte[] represents raw binary data.
String represents text, which has an associated charset/encoding to be able to tell which characters it represents.

Binary Data ≠ Text.

Text data inside a String has Unicode/UTF-16 as charset/encoding (or Unicode/mUTF-8 when serialized). Whenever you convert from something that is not a String to a String or viceversa, you need to specify a charset/encoding for the non-String text data (even if you do it implicitly, using the platform's default charset).

A PNG file contains raw binary data that represents an image (and associated metadata), not text. Therefore, you should not treat it as text.

\x89PNG is not text, it's just a "magic" header for identifying PNG files. 0x89 isn't even a character, it's just an arbitrary byte value, and its only sane representations for display are things like \x89, 0x89, ... Likewise, PNG there is in reality binary data, it could as well have been 0xdeadbeef and it would have changed nothing. The fact that PNG happens to be human-readable is just a convenience.

Your problem comes from the fact that your protocol mixes text and binary data, while Java (unlike some other languages, like C) treats binary data differently than text.

Java provides *InputStream for reading binary data, and *Reader for reading text. I see two ways to deal with input:

Treat everything as binary data. When you read a whole text line, convert it into a String, using the appropriate charset/encoding.
Layer a InputStreamReader on top of a InputStream, access the InputStream directly when you want binary data, access the InputStreamReader when you want text.

You may want buffering, the correct place to put it in the second case is below the *Reader. If you used a BufferedReader, the BufferedReader would probably consume more input from the InputStream than it should. So, you would have something like:

 ┌───────────────────┐
 │ InputStreamReader │
 └───────────────────┘
          ↓
┌─────────────────────┐
│ BufferedInputStream │
└─────────────────────┘
          ↓
   ┌─────────────┐
   │ InputStream │
   └─────────────┘

You would use the InputStreamReader to read text, then you would use the BufferedInputStream to read an appropriate amount of binary data from the same stream.

A problematic case is recognizing both "\r" (old MacOS) and "\r\n" (DOS/Windows) as line terminators. In that case, you may end up reading one character too much. You could take the approach that the deprecated DataInputStream.readline() method took: transparently wrap the internal InputStream into a PushbackInputStream and unread that character.

However, since you don't appear to have a Content-Length, I would recommend the first way, treating everything as binary, and convert to String only after reading a whole line. In this case, I would treat the MIME delimiter as binary data.

Output:

Since you are dealing with binary data, you cannot just println() it. PrintStream has write() methods that can deal with binary data (e.g: for outputting to a binary file).

Or maybe your data has to be transported on a channel that treats it as text. Base64 is designed for that exact situation (transporting binary data as ASCII text). Base64 encoded form uses only US_ASCII characters, so you should be able to use it with any charset/encoding that is a superset of US_ASCII (ISO-8859-*, UTF-8, CP-1252, ...). Since you are converting binary data to/from text, the only sane API for Base64 would be something like:

String Base64Encode(byte[] data);
byte[] Base64Decode(String encodedData);

which is basically what the internal java.util.prefs.Base64 uses.

Conclusion:

In Java, String ≠ byte[].

Binary Data ≠ Text.

answered Nov 12 '22 02:11

ninjalj

Related questions
                            
                                AES/CBC/PKCS5PADDING IV - Decryption in NodeJs (Encrypted in Java)
                            
                                how to write a complete server client communication using java nio [closed]
                            
                                Java/Servlet: get current sql.Date
                            
                                Under Tomcat java.lang.NoClassDefFoundError when accessing a servlet?
                            
                                When is Java the right choice for Web-based applications [closed]
                            
                                How to execute a piece of code only after all threads are done
                            
                                How to make another thread sleep in Java
                            
                                Method in Java to create a file at a location, creating directories if necessary?
                            
                                Java: how to get all subdirs recursively?
                            
                                Java regex for accepting a valid hostname,IPv4, or IPv6 address
                            
                                Guice vs AspectJ
                            
                                replaceAll does not replace string [duplicate]
                            
                                Java exception handling
                            
                                String (dd-MM-yyyy HH:mm) to Date (yyyy-MM-dd HH:mm) | Java
                            
                                How can I decompile many jars at once?
                            
                                unhandled exception type error
                            
                                For Java, can I import all packages at once?
                            
                                Good reasons to avoid using try-catch statements
                            
                                java - general synchronizedList question
                            
                                How to go through the collection without using any loop construct?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With