Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading any text file having strange encoding?

I have a text file with a strange encoding "UCS-2 Little Endian" that I want to read its contents using Java.

Opening the text file using NotePad++

As you can see in the above screenshot the file contents appear fine in Notepad++, but when i read it using this code, just garbage is being printed in the console:

String textFilePath = "c:\strange_file_encoding.txt"
BufferedReader reader = new BufferedReader( new InputStreamReader( new FileInputStream( filePath ), "UTF8" ) );
String line = "";

while ( ( line = reader.readLine() ) != null ) {
    System.out.println( line );  // Prints garbage characters 
}

The main point is that the user selects the file to read, so it can be of any encoding, and since I can't detect the file encoding I decode it using "UTF8" but as in the above example it fails to read it right.

Is there away to read such strange files in a right way ? Or at least can i detect if my code will fail to read it right ?

like image 325
Brad Avatar asked Mar 19 '13 22:03

Brad


People also ask

Why do some text files show strange characters?

Character corruption happens when the save file uses a different type of default file encoding from the end user's program. Most computer programs use UTF-8 encoding by default but foreign characters normally have one or multiple language-specific encoding systems as well.

What is the encoding of a text file?

An encoding standard is a numbering scheme that assigns each text character in a character set to a numeric value. A character set can include alphabetical characters, numbers, and other symbols.

Do text files have encoding?

A text encoding is basically a file format for text files. It's important to distinguish the difference between a text file encoding and how each code point is stored in memory. Just because 2 bytes may be used to store each code point doesn't mean that it is an encoding.


1 Answers

You are using UTF-8 as your encoding in the InputStreamReader constructor, so it will try to interpret the bytes as UTF-8 instead of UCS-LE. Here is the documentation: Charset

I suppose you need to use UTF-16LE according to it.

Here is more info on the supported character sets and their Java names: Supported Encodings

like image 71
tempoc Avatar answered Nov 10 '22 13:11

tempoc