Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save Chinese Characters to file with java?

I use the following code to save Chinese characters into a .txt file, but when I opened it with Wordpad, I couldn't read it.

StringBuffer Shanghai_StrBuf = new StringBuffer("\u4E0A\u6D77");
boolean Append = true;

FileOutputStream fos;
fos = new FileOutputStream(FileName, Append);
for (int i = 0;i < Shanghai_StrBuf.length(); i++) {
    fos.write(Shanghai_StrBuf.charAt(i));
}
fos.close();

What can I do ? I know if I cut and paste Chinese characters into Wordpad, I can save it into a .txt file. How do I do that in Java ?

like image 247
Frank Avatar asked Apr 19 '09 23:04

Frank


2 Answers

There are several factors at work here:

  • Text files have no intrinsic metadata for describing their encoding (for all the talk of angle-bracket taxes, there are reasons XML is popular)
  • The default encoding for Windows is still an 8bit (or doublebyte) "ANSI" character set with a limited range of values - text files written in this format are not portable
  • To tell a Unicode file from an ANSI file, Windows apps rely on the presence of a byte order mark at the start of the file (not strictly true - Raymond Chen explains). In theory, the BOM is there to tell you the endianess (byte order) of the data. For UTF-8, even though there is only one byte order, Windows apps rely on the marker bytes to automatically figure out that it is Unicode (though you'll note that Notepad has an encoding option on its open/save dialogs).
  • It is wrong to say that Java is broken because it does not write a UTF-8 BOM automatically. On Unix systems, it would be an error to write a BOM to a script file, for example, and many Unix systems use UTF-8 as their default encoding. There are times when you don't want it on Windows, either, like when you're appending data to an existing file: fos = new FileOutputStream(FileName,Append);

Here is a method of reliably appending UTF-8 data to a file:

  private static void writeUtf8ToFile(File file, boolean append, String data)
      throws IOException {
    boolean skipBOM = append && file.isFile() && (file.length() > 0);
    Closer res = new Closer();
    try {
      OutputStream out = res.using(new FileOutputStream(file, append));
      Writer writer = res.using(new OutputStreamWriter(out, Charset
          .forName("UTF-8")));
      if (!skipBOM) {
        writer.write('\uFEFF');
      }
      writer.write(data);
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    String chinese = "\u4E0A\u6D77";
    boolean append = true;
    writeUtf8ToFile(new File("chinese.txt"), append, chinese);
  }

Note: if the file already existed and you chose to append and existing data wasn't UTF-8 encoded, the only thing that code will create is a mess.

Here is the Closer type used in this code:

public class Closer implements Closeable {
  private Closeable closeable;

  public <T extends Closeable> T using(T t) {
    closeable = t;
    return t;
  }

  @Override public void close() throws IOException {
    if (closeable != null) {
      closeable.close();
    }
  }
}

This code makes a Windows-style best guess about how to read the file based on byte order marks:

  private static final Charset[] UTF_ENCODINGS = { Charset.forName("UTF-8"),
      Charset.forName("UTF-16LE"), Charset.forName("UTF-16BE") };

  private static Charset getEncoding(InputStream in) throws IOException {
    charsetLoop: for (Charset encodings : UTF_ENCODINGS) {
      byte[] bom = "\uFEFF".getBytes(encodings);
      in.mark(bom.length);
      for (byte b : bom) {
        if ((0xFF & b) != in.read()) {
          in.reset();
          continue charsetLoop;
        }
      }
      return encodings;
    }
    return Charset.defaultCharset();
  }

  private static String readText(File file) throws IOException {
    Closer res = new Closer();
    try {
      InputStream in = res.using(new FileInputStream(file));
      InputStream bin = res.using(new BufferedInputStream(in));
      Reader reader = res.using(new InputStreamReader(bin, getEncoding(bin)));
      StringBuilder out = new StringBuilder();
      for (int ch = reader.read(); ch != -1; ch = reader.read())
        out.append((char) ch);
      return out.toString();
    } finally {
      res.close();
    }
  }

Usage:

  public static void main(String[] args) throws IOException {
    System.out.println(readText(new File("chinese.txt")));
  }

(System.out uses the default encoding, so whether it prints anything sensible depends on your platform and configuration.)

like image 136
McDowell Avatar answered Nov 10 '22 01:11

McDowell


If you can rely that the default character encoding is UTF-8 (or some other Unicode encoding), you may use the following:

    Writer w = new FileWriter("test.txt");
    w.append("上海");
    w.close();

The safest way is to always explicitly specify the encoding:

    Writer w = new OutputStreamWriter(new FileOutputStream("test.txt"), "UTF-8");
    w.append("上海");
    w.close();

P.S. You may use any Unicode characters in Java source code, even as method and variable names, if the -encoding parameter for javac is configured right. That makes the source code more readable than the escaped \uXXXX form.

like image 34
Esko Luontola Avatar answered Nov 10 '22 01:11

Esko Luontola