Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert custom encoded file to UTF-8 (in Java or with a dedicated tool)

A legacy software I'm rewriting in Java uses custom (similar to Win-1252) encoding as it's data storage. For the new system I'm building I'd like to replace this with UTF-8.

So I need to convert those files to UTF-8 to feed my database. I know the character map used, but it's not any of the widely known ones. Eg. "A" is on position 0x0041 (as in Win-1252), but on 0x0042 there is a sign which in UTF-8 appears on position 0x0102, and so on. Is there an easy way to decode and convert those files with Java?

I've read many posts already but they all dealt with industry standard encodings of some kind, not with custom ones. I'm expecting it's possible to create a custom java.nio.ByteBuffer.CharsetDecoder or java.nio.charset.Charset to pass it to java.io.InputStreamReader as described in the first Answer here?

Any suggestions welcome.

like image 720
mmm Avatar asked Jan 20 '11 08:01

mmm


People also ask

How do I convert to UTF-8 in Java?

In Java, the OutputStreamWriter accepts a charset to encode the character streams into byte streams. We can pass a StandardCharsets. UTF_8 into the OutputStreamWriter constructor to write data to a UTF-8 file.

How do I change the encoding to UTF-8 in Eclipse?

In Eclipse, go to Preferences>General>Workspace and select UTF-8 as the Text File Encoding. This should set the encoding for all the resources in your workspace.


1 Answers

no need to be complicated. just make an array of 256 chars

static char[] map = { ... 'A', '\u0102', ... }

then

read each byte b in source
    int index = (0xff) & b; // to make it unsigned
    char c = map[index];
    target.write( c );
like image 146
irreputable Avatar answered Sep 28 '22 06:09

irreputable