Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird behavior with java Scanner reading files

Tags:

java

So, I just ran into an interesting problem while using the Scanner class to read contents from files. Basically, I'm trying to read several output files produced by a parsing application from a directory to compute some accuracy metrics.

Basically, my code just walks through each of the files in the directory, and opens them up with a scanner to process the contents. For whatever reason, a few of the files (all UTF-8 encoded) were not being read by the Scanner. Even though the files were not empty, scanner.hasNextLine() would return false upon its first call (I opened up the debugger and observed this). I was initializing the scanner directly with the File objects each time (the file Objects were successfully created). i.e:

    File file = new File(pathName);
    ...
    Scanner scanner = new Scanner(file);

I tried a couple of things, and was eventually able to fix this problem by initializing the scanner in the following way:

    Scanner scanner = new Scanner(new FileInputStream(file));

Though I'm happy to have solved the problem, I'm still curious as to what might have been happening to cause the problem before. Any ideas? Thanks so much!

like image 695
shaunvxc Avatar asked Dec 14 '12 15:12

shaunvxc


1 Answers

According to the Scanner.java source in Java 6u23 a new line is detected by

private static final String LINE_SEPARATOR_PATTERN = 
                                       "\r\n|[\n\r???]";
private static final String LINE_PATTERN = ".*("+LINE_SEPARATOR_PATTERN+")|.+$";

So you could check whether you can match the following regex to the content in the files that were not read.

.*(\r\n|[\n\r???])|.+$

Also I would check if there were some exception raised.

UPDATE: This made me curious and I looked for answers. Seems your question has been asked and solved already here: Java Scanner(File) misbehaving, but Scanner(FIleInputStream) always works with the same file

To summarize it's about characters that are out of ASCII, that are behaving differently depending on whether you initialize your Scanner with File or FileInputStream.

like image 131
Will Avatar answered Oct 17 '22 10:10

Will