Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

Tags:

java

encoding

I know it's a very general question but I'm becoming mad.

I used this code:

String ucs2Content = new String(bufferToConvert, inputEncoding);        
        byte[] outputBuf = ucs2Content.getBytes(outputEncoding);        
        return outputBuf;

But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:

// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();

Indeed this code appends to the buffer a sequence of null character!!!!!

Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.

Is there a better way to convert encoding in Java?

like image 744
robob Avatar asked May 25 '11 16:05

robob


People also ask

How to convert a String into UTF-8 encoding in Java?

In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.

Is Windows 1252 a subset of UTF-8?

Windows-1252 is a subset of UTF-8 in terms of 'what characters are available', but not in terms of their byte-by-byte representation. Windows-1252 has characters between bytes 127 and 255 that UTF-8 has a different encoding for. Any visible character in the ASCII range (127 and below) are encoded 1:1 in UTF-8.

What is UTF-8 encoding in Java?

UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points. A code point can represent single characters, but also have other meanings, such as for formatting.


2 Answers

Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.

The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.

like image 111
jarnbjo Avatar answered Oct 18 '22 17:10

jarnbjo


I am not sure how you get a sequence of null characters. Try this

String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! £€".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));

prints

Hello World! £€
like image 43
Peter Lawrey Avatar answered Oct 18 '22 17:10

Peter Lawrey