Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Scanner(File) misbehaving, but Scanner(FIleInputStream) always works with the same file

I am having weird behavior with Scanner. It will work with a particular set of files I am using when I use the Scanner(FileInputStream) constructor, but it won't with the Scanner(File) constructor.

Case 1: Scanner(File)

Scanner s = new Scanner(new File("file"));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: no output

Case 2: Scanner(FileInputStream)

Scanner s = new Scanner(new FileInputStream(new File("file")));
while(s.hasNextLine()) {
    System.out.println(s.nextLine());
}

Result: the file content outputs to the console.

The input file is a java file containing a single class.

I double checked programmatically (in Java) that:

  • the file exists,
  • is readable,
  • and has a non-zero filesize.

Typically Scanner(File) works for me in this case, I am not sure why it doesn't now.

like image 282
kashiko Avatar asked Feb 29 '12 01:02

kashiko


2 Answers

hasNextLine() calls findWithinHorizon() which in turns calls findPatternInBuffer(), searching a match for a line terminator character pattern defined as .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$

Strange thing is that with both ways to construct a Scanner (with FileInputStream or via File), findPatternInBuffer returns a positive match if the file contains (independently from file size) for instance the 0x0A line terminator; but in the case the file contains a character out of ascii (ie >= 7f), using FileInputStream returns true while using File returns false.

Very simple test case:

create a file which contains just char "a"

# hexedit file     
00000000   61 0A                                                a.

# java Test.java
using File: true
using FileInputStream: true

now edit the file with hexedit to:

# hexedit file
00000000   61 0A 80                                             a..

# java Test.java
using File: false
using FileInputStream: true

in the test java code there is nothing else than what already in the question:

import java.io.*;
import java.lang.*;
import java.util.*;
public class Test {
    public static void main(String[] args) {
        try {
                File file1 = new File("file");
                Scanner s1 = new Scanner(file1);
                System.out.println("using File: "+s1.hasNextLine());
                File file2 = new File("file");
                Scanner s2 = new Scanner(new FileInputStream(file2));
                System.out.println("using FileInputStream: "+s2.hasNextLine());
        } catch (IOException e) {
                e.printStackTrace();
        }
    }
}

SO, it turns out this is a charset issue. In facts, changing the test to:

 Scanner s1 = new Scanner(file1, "latin1");

we get:

# java Test 
using File: true
using FileInputStream: true
like image 194
guido Avatar answered Oct 21 '22 16:10

guido


From looking at the Oracle/Sun JDK's 1.6.0_23 implementation of Scanner, the Scanner(File) constructor invokes a FileInputStream, which is meant for raw binary data.

This points to a difference in buffering and parsing technique used when invoking one constructor or another, which will directly impact your code on the call to hasNextLine().

Scanner(InputStream) uses an InputStreamReader while Scanner(File) uses an InputStream passed to a ByteChannel (and probably reads the whole file in one jump, thus advancing the cursor, in your case).

like image 40
haylem Avatar answered Oct 21 '22 15:10

haylem