Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

File encoded as UCS-2 Little Endian reports 2x too many lines to Java

I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:

int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
    count++;
}

For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.

But now curiosity remains. Why was the encoding causing a double line count report?

Thanks!

like image 299
The111 Avatar asked Dec 01 '22 00:12

The111


1 Answers

Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).

like image 50
Lucero Avatar answered Dec 04 '22 21:12

Lucero