I was processing several txt files with a simple Java program, and the first step of my process is counting the lines of each file:
int count = 0;
br = new BufferedReader(new FileReader(myFile)); // myFile is the txt file in question
while (br.readLine() != null) {
count++;
}
For one of my files, Java was counting exactly twice as many lines as there really were! This was confusing me greatly at first. I opened each file in Notepad++ and could see that the mis-counting file ended every line in exactly the same way as the other files, with a CR and LF. I did a little more poking around and noticed that all my "ok" files were ANSI encoded, and the one problem file was encoded as UCS-2 Little Endian (which I know nothing about). I got these files elsewhere, so I have no idea why the one was encoded that way, but of course switching it to ANSI fixed the issue.
But now curiosity remains. Why was the encoding causing a double line count report?
Thanks!
Simple: if you apply the wrong encoding when reading UCS-2 (or UTF-16) text (e.g. ANSI, or any 8-bit encoding), then every second character is a 0x0. This then breaks the CR-LF to CR-0-LF, which is seen as two line changes (one for CR and one for LF).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With