Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read file and write file which has characters in UTF - 8 (different language)

Tags:

java

java-io

I have a file which has characters like: " Joh 1:1 ஆதியிலே வார்த்தை இருந்தது, அந்த வார்த்தை தேவனிடத்திலிருந்தது, அந்த வார்த்தை தேவனாயிருந்தது. "

www.unicode.org/charts/PDF/U0B80.pdf‎

When I use the following code:

bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, "UTF8"));

The output is boxes and other weird characters like this:

"�P�^����O֛���;�<�aYՠ؛"

Can anyone help?

these are the complete codes:

File f=new File("E:\\bible.docx");
        Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8);
        bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
        char[] buffer = new char[1024];
        int n;
        StringBuilder build=new StringBuilder();
        while(true){
            n=decoded.read(buffer);
            if(n<0){break;}
            build.append(buffer,0,n);
            bufferedWriter.write(buffer);
        }

enter image description here

The StringBuilder value shows the UTF characters but when displaying it in the window it shows as boxes..

Found the Answer to the problem!!! The Encoding is Correct (i.e UTF-8) Java reads the file as UTF-8 and the String characters are UTF-8, The problem is that there is no font to display it in netbeans' output panel. After changing the font for the output panel (Netbeans->tools->options->misc->output tab) I got the expected result. The same applies when it is displayed in JTextArea(font needs to be changed). But we can't change font the windows' cmd prompt.

like image 733
Alfa Avatar asked Aug 01 '13 03:08

Alfa


People also ask

Can UTF-8 handle all languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).

What is the difference between UTF-8 and UTF-8?

There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.

Does UTF-16 support all languages?

UTF(Unicode Transformation Format) is the standard for representing a great variety of characters from any language. To overcome the problem of ASCII which is only limited to 128 characters, UTF was developed to encode all characters for each and every language.


2 Answers

Because your output is encoded in UTF-8, but still contains the replacement character (U+FFFD, �), I believe the problem occurs when you read the data.

Make sure that you know what encoding your input stream uses, and set the encoding for the InputStreamReader according. If that's Tamil, I would guess it's probably in UTF-8. I don't know if Java supports TACE-16. It would look something like this…

StringBuilder buffer = new StringBuilder();
try (InputStream encoded = ...) {
  Reader decoded = new InputStreamReader(encoded, StandardCharsets.UTF_8);
  char[] buffer = new char[1024];
  while (true) {
    int n = decoded.read(buffer);
    if (n < 0)
      break;
    buffer.append(buffer, 0, n);
  }
}
String verse = buffer.toString();
like image 91
erickson Avatar answered Oct 27 '22 09:10

erickson


System.out is too near to the operating system, to be versatile enough. In your case, the NetBeans console probably is using the operating system encoding, and IDE picked font.

Write to a file first. If you make it HTML, you can even double click it, and specify internally the right encoding. Mind using "UTF-8" then, as "UTF8" is Java specific ("UTF-8" can be used in Java too). Maybe with JDesktop.getDesktop().open("... .html");.

A small JFrame with a JTextPane would do too.

like image 25
Joop Eggen Avatar answered Oct 27 '22 09:10

Joop Eggen