Converting from Windows 1252 to UTF8 in Java: null characters with CharsetDecoder/Encoder

Tags:

encoding

I know it's a very general question but I'm becoming mad.

I used this code:

String ucs2Content = new String(bufferToConvert, inputEncoding);        
        byte[] outputBuf = ucs2Content.getBytes(outputEncoding);        
        return outputBuf;

But I read that is better to use CharsetDecoder and CharsetEncoder (I have contents with some character probably outside the destination encoding). I've just written this code but that has some problems:

// Create the encoder and decoder for Win1252
Charset charsetInput = Charset.forName(inputEncoding);
CharsetDecoder decoder = charsetInput.newDecoder();

Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
return bbuf.array();

Indeed this code appends to the buffer a sequence of null character!!!!!

Could someone tell me where is the problem? I'm not so skilled with encoding conversion in Java.

Is there a better way to convert encoding in Java?

744

asked May 25 '11 16:05

robob

2 Answers

Your problem is that ByteBuffer.array() returns a direct reference to the array used as backing store for the ByteBuffer and not a copy of the backing array's valid range. You have to obey bbuf.limit() (as Peter did in his response) and just use the array content from index 0 to bbuf.limit()-1.

The reason for the extra 0 values in the backing array is a slight flaw in how the resulting ByteBuffer is created by the CharsetEncoder. Each CharsetEncoder has an "average bytes per character", which for the UCS2 encoder seem to be simple and correct (2 bytes/char). Obeying this fixed value, the CharsetEncoder initially allocates a ByteBuffer with "string length * average bytes per character" bytes, in this case e.g. 20 bytes for a 10 character long string. The UCS2 CharsetEncoder starts however with a BOM (byte order mark), which also occupies 2 bytes, so that only 9 of the 10 characters fit in the allocated ByteBuffer. The CharsetEncoder detects the overflow and allocates a new ByteBuffer with a length of 2*n+1 (n being the original length of the ByteBuffer), in this case 2*20+1 = 41 bytes. Since only 2 of the 21 new bytes are required to encode the remaining character, the array you get from bbuf.array() will have a length of 41 bytes, but bbuf.limit() will indicate that only the first 22 entries are actually used.

111

answered Oct 18 '22 17:10

jarnbjo

I am not sure how you get a sequence of null characters. Try this

String outputEncoding = "UTF-8";
Charset charsetOutput = Charset.forName(outputEncoding);
CharsetEncoder encoder = charsetOutput.newEncoder();

// Convert the byte array from starting inputEncoding into UCS2
byte[] bufferToConvert = "Hello World! £€".getBytes();
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert));

// Convert the internal UCS2 representation into outputEncoding
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf));
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));

prints

Hello World! £€

answered Oct 18 '22 17:10

Peter Lawrey

Related questions
                            
                                Database Access in Android
                            
                                How can I start and keep running hsqldb in server mode from within my web application?
                            
                                Setting up environment variable in ant script
                            
                                JSF CDI : Conversation scope bean[s] best practice
                            
                                Solving The 8 Puzzle With A* Algorithm
                            
                                How do I debug GlassFish 3 using Eclipse Helios?
                            
                                Why do Guava classes provide so many factory methods instead of just one that takes varargs? [duplicate]
                            
                                Hibernate @OneToOne mapping with a @Where clause
                            
                                Java floating-point numbers representation as a hexadecimal numbers
                            
                                Capture javax.net.debug to file
                            
                                Transaction is alternating Timeouts
                            
                                Java: Dead code elimination
                            
                                nBuilder alternative for Java
                            
                                Java API for financial data [closed]
                            
                                SimpleXml framework - embedded collections
                            
                                Why do my SwingWorker threads keep running even though they are done executing?
                            
                                Stanford POS tagger in Java usage
                            
                                what are java middleware technologies
                            
                                Java Class File Editor
                            
                                Loading large images as thumbnails without memory issues in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With