Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a file in Java with specific character encoding?

I am trying to read a file in as either UTF-8 or Windows-1252 depending on the output of this method:

public Charset getCorrectCharsetToApply() {
    // Returns a Charset for either UTF-8 or Windows-1252.
}

So far, I have:

String fileName = getFileNameToReadFromUserInput();
InputStream is = new ByteArrayInputStream(fileName.getBytes());
InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply());
BufferedReader buffReader = new BufferedReader(isr);

The problem I'm having is converting the BufferedReader instance to a FileReader.

Furthermore:

  • The name of the file itself (fileName) cannot be trusted to be a particular Charset; sometime the file name will contain UTF-8 characters, and sometimes Windows-1252. Same goes for the file's content (however if file name and file content will always have matching charsets).
  • Only the logic inside getCorrectCharsetToApply() can select the charset to apply, so attempting to read a file by its name prior to calling this method could very well result with, Java trying to read the file name with the wrong encoding...which causes it to die!

Thanks in advance!

like image 292
IAmYourFaja Avatar asked Aug 23 '12 17:08

IAmYourFaja


People also ask

How do I read a specific text file in Java?

There are several ways to read a plain text file in Java e.g. you can use FileReader, BufferedReader, or Scanner to read a text file. Every utility provides something special e.g. BufferedReader provides buffering of data for fast reading, and Scanner provides parsing ability.

Does Java use UTF-8 or UTF-16?

The native character encoding of the Java programming language is UTF-16.

How do you encode a special character in a String in Java?

When encoding a String, the following rules apply: The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same. The special characters ".", "-", "*", and "_" remain the same. The blank space character " " is converted into a plus sign "+".


3 Answers

So, first, as a heads up, do realize that fileName.getBytes() as you have there gets the bytes of the filename, not the file itself.

Second, reading inside the docs of FileReader:

The constructors of this class assume that the default character encoding and the default byte-buffer size are appropriate. To specify these values yourself, construct an InputStreamReader on a FileInputStream.

So, sounds like FileReader actually isn't the way to go. If we take the advice in the docs, then you should just change your code to have:

String fileName = getFileNameToReadFromUserInput();
FileInputStream is = new FileInputStream(fileName);
InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply());
BufferedReader buffReader = new BufferedReader(isr);

and not try to make a FileReader at all.

like image 200
Dennis Meng Avatar answered Oct 18 '22 23:10

Dennis Meng


With Java 7+, you can create the Reader in one line:

BufferedReader buffReader = Files.newBufferedReader(Paths.get(fileName), getCorrectCharsetToApply());

like image 6
dlauzon Avatar answered Oct 18 '22 23:10

dlauzon


Note that if you are using Google Guava, you can use Files.newReader:

final BufferedReader reader =
        Files.newReader(new File(filename), getCorrectCharsetToApply());
like image 4
shadowmatter Avatar answered Oct 18 '22 22:10

shadowmatter