Curly quotes causing Java Scanner hasNextLine() to be false -- why?

Tags:

I've been having an issue getting the java.util.Scanner to read a text file I saved in Notepad, even though it works fine with others. Basically, when it tries to read the problem file, it comes up completely empty handed -- hasNextLine() is false, buffer is empty, etc. I narrowed it down to the fact that it won't even read the first line if there is a curly quote anywhere in the file. No exceptions are thrown. Note that a BufferedReader on the same file doesn't have a problem.

try {        
    int count = 0;
    Scanner scanner = new Scanner(new File("C:/myfile.txt"));

    while (scanner.hasNextLine()) {
        count++;
        scanner.nextLine();
    }

    scanner.close();
    System.out.print(count);

    count = 0;
    BufferedReader reader = new BufferedReader(new FileReader("C:/myfile.txt"));

    while (reader.readLine() != null) {
        count++;
    }

    reader.close();
    System.out.print(count);
}
catch(IOException e) {
    e.printStackTrace();
}

The above code, reading a file that contains nothing but a single curly quote, prints out "01". Searches on Google led me to try this:

Scanner scanner = new Scanner(new File("C:/myfile.txt"), "ISO-8859-1");

This makes it work (ie. it prints out "11"). I also noticed that if I go into Notepad and do a Save As... the default encoding at the bottom is "ANSI." If I change this to "UTF-8" and save the file, then the scanner (without an encoding) also works. If I tell the scanner "UTF-8", then understandably it only works if I save as UTF-8, but "ISO-8859-1" seems to make it work even if I save it as "ANSI".

So, I know it has something to do with file encoding, but the problem is I don't understand anything about file encoding. My knowledge of what "ISO-8859-1" means is extremely vague; why does that make it work no matter how I save the file? Why does BufferedReader work regardless?

EDIT:

The links/comments below really helped point me in the right direction! I think I've got it figured out.

First of all, in Notepad:

"ANSI" is CP1252
"Unicode" is UTF-16LE
"UTF-8" is... well, UTF-8

In hexadecimal, a curly apostrophe is represented as:

CP1252: 92
UTF-16LE: 1920
UTF-8: E2 80 99

The default encoding Java uses on my system, according to Charset.defaultCharset(), is UTF-8. So when I saved the file in UTF-8, the scanner knew what to expect. When I saved the file in CP1252, however, it choked once it hit that "92", because it's not a valid way to represent a character in that encoding. It works fine as long as there aren't any such chracters in the file -- the hex for "hello world" happens to be the same in both CP1252 and UTF-8 and doesn't happen to cause a problem.

UTF-8 doesn't work with a UTF-16 file, because it doesn't know what to do with the byte order mark ("FFFE"), regardless of what characters are in the file.

On the other hand, when I set the scanner to CP1252 or ISO-8859-1, it's much more tolerant. It doesn't necessarily interpret the characters correctly, mind you, but there's nothing that prevents it from recognizing lines in the file and looping through.

As far as why Scanner has a problem but the FileReader/BufferedReader does not, I am going to guess that it's because the scanner needs to tokenize the file, ie. interpret the characters so it can identify whitespace and other patterns, so it chokes when there's something unrecognizable. The reader doesn't need to do that. All it needs to identify are the line breaks.

275

asked Sep 19 '13 17:09

MysteriousWhisper

1 Answers

If you don't specify an encoding when you create the scanner it will try to divine the encoding based on a byte order mark (BOM), which is the first few bytes of a file. If it doesn't have one, it will default to whatever default the OS uses. Since you're using Windows, the default is cp-1252. It seems that notepad is saving your text file using ISO-8859-1 which is similar, but not that same as cp-1252. See this link for more details:

http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

When you save it as UTF-8, it probably places the UTF-8 BOM at the beginning of the file and the scanner can pick up on it.

If you want to look more into BOM, look it up in wikipedia--the article is quite good. You can also download PSPad and open the text file in hex mode to see the individual bytes. Hope that helps :)

188

answered Nov 14 '22 09:11

Craig Schmidt

Related questions
                            
                                spring mvc form:select tag
                            
                                Maven error message Artifact has no file
                            
                                Timeline creation library for java [closed]
                            
                                Having a Left and Right Aligned Label inside a JCheckBox
                            
                                Why is a single "if" slower than "switch"? [duplicate]
                            
                                Android, make scrollable view overflow to the left instead of right?
                            
                                Injecting Mockito Mock objects using Spring JavaConfig and @Autowired
                            
                                Is it possible to rewrite previous line in console?
                            
                                Unit test succeeds in debug mode but fails when running it normally
                            
                                Why Are all Middle Clicks in Java Reported as Having the Alt Modifier?
                            
                                How to stop toast & alertDialog losing focus on my EditText filter
                            
                                Java Properties: how to escape # (hash)
                            
                                Dynamically Create Logback Loggers and Appenders
                            
                                Eclipse auto completion with generic classes and static methods
                            
                                How to develop custom status bar in android JellyBean 4.2.2
                            
                                how to setup "Main Class" in "Run Configurations" in Eclipse
                            
                                Spring Security + MVC : same @RequestMapping, different @Secured
                            
                                How to get words from google translate phrasebook?
                            
                                How can I refer to the type of the current class?
                            
                                Java equivalent of Smalltalk's become:

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Curly quotes causing Java Scanner hasNextLine() to be false -- why?

Tags:

java

encoding

utf-8

MysteriousWhisper

People also ask

1 Answers

Craig Schmidt

Recent Activity

Donate For Us