I have a file which has characters like: " Joh 1:1 ஆதியிலே வார்த்தை இருந்தது, அந்த வார்த்தை தேவனிடத்திலிருந்தது, அந்த வார்த்தை தேவனாயிருந்தது. "
www.unicode.org/charts/PDF/U0B80.pdf
When I use the following code:
bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, "UTF8"));
The output is boxes and other weird characters like this:
"�P�^����O֛���;�<�aYՠ؛"
Can anyone help?
these are the complete codes:
File f=new File("E:\\bible.docx");
Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8);
bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
char[] buffer = new char[1024];
int n;
StringBuilder build=new StringBuilder();
while(true){
n=decoded.read(buffer);
if(n<0){break;}
build.append(buffer,0,n);
bufferedWriter.write(buffer);
}
The StringBuilder value shows the UTF characters but when displaying it in the window it shows as boxes..
Found the Answer to the problem!!! The Encoding is Correct (i.e UTF-8) Java reads the file as UTF-8 and the String characters are UTF-8, The problem is that there is no font to display it in netbeans' output panel. After changing the font for the output panel (Netbeans->tools->options->misc->output tab) I got the expected result. The same applies when it is displayed in JTextArea(font needs to be changed). But we can't change font the windows' cmd prompt.
UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL).
There is no difference between "utf8" and "utf-8"; they are simply two names for UTF8, the most common Unicode encoding.
UTF(Unicode Transformation Format) is the standard for representing a great variety of characters from any language. To overcome the problem of ASCII which is only limited to 128 characters, UTF was developed to encode all characters for each and every language.
Because your output is encoded in UTF-8, but still contains the replacement character (U+FFFD
, �), I believe the problem occurs when you read the data.
Make sure that you know what encoding your input stream uses, and set the encoding for the InputStreamReader
according. If that's Tamil, I would guess it's probably in UTF-8. I don't know if Java supports TACE-16. It would look something like this…
StringBuilder buffer = new StringBuilder();
try (InputStream encoded = ...) {
Reader decoded = new InputStreamReader(encoded, StandardCharsets.UTF_8);
char[] buffer = new char[1024];
while (true) {
int n = decoded.read(buffer);
if (n < 0)
break;
buffer.append(buffer, 0, n);
}
}
String verse = buffer.toString();
System.out
is too near to the operating system, to be versatile enough. In your case, the NetBeans console probably is using the operating system encoding, and IDE picked font.
Write to a file first. If you make it HTML, you can even double click it, and specify internally the right encoding. Mind using "UTF-8" then, as "UTF8" is Java specific ("UTF-8" can be used in Java too). Maybe with JDesktop.getDesktop().open("... .html");
.
A small JFrame with a JTextPane would do too.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With