A legacy software I'm rewriting in Java uses custom (similar to Win-1252) encoding as it's data storage. For the new system I'm building I'd like to replace this with UTF-8.
So I need to convert those files to UTF-8 to feed my database. I know the character map used, but it's not any of the widely known ones. Eg. "A" is on position 0x0041 (as in Win-1252), but on 0x0042 there is a sign which in UTF-8 appears on position 0x0102, and so on. Is there an easy way to decode and convert those files with Java?
I've read many posts already but they all dealt with industry standard encodings of some kind, not with custom ones. I'm expecting it's possible to create a custom java.nio.ByteBuffer.CharsetDecoder or java.nio.charset.Charset to pass it to java.io.InputStreamReader as described in the first Answer here?
Any suggestions welcome.
In Java, the OutputStreamWriter accepts a charset to encode the character streams into byte streams. We can pass a StandardCharsets. UTF_8 into the OutputStreamWriter constructor to write data to a UTF-8 file.
In Eclipse, go to Preferences>General>Workspace and select UTF-8 as the Text File Encoding. This should set the encoding for all the resources in your workspace.
no need to be complicated. just make an array of 256 chars
static char[] map = { ... 'A', '\u0102', ... }
then
read each byte b in source
    int index = (0xff) & b; // to make it unsigned
    char c = map[index];
    target.write( c );
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With