Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: ASCII random line file access with state

Tags:

java

io

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?

  • Given an ASCII file of arbitrary large size where each line is terminated by a \n
  • For each invocation of some method readLine() read a random line from the file
  • And for the life of the file handle no call to readLine() should return the same line twice

Update:

  • All lines must eventually be read

Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

like image 877
cfeduke Avatar asked Dec 26 '22 11:12

cfeduke


2 Answers

In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream. With RandomAccessFile, you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. The code would look something like this.

RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
    //seek to a random point in the file
    raf.seek((long)(Math.random()*raf.length()));

    //skip from the random location to the beginning of the next line
    int nextByte = raf.read();
    while(((char)nextByte) != '\n')
    {
        if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
        nextByte = raf.read();
    }

    //read the line into a buffer
    StringBuffer lineBuffer = new StringBuffer();
    nextByte = raf.read();
    while(nextByte != -1 && (((char)nextByte) != '\n'))
        lineBuffer.append((char)nextByte);

    //ensure uniqueness
    String line = lineBuffer.toString();
    if(sampledLines.get(line.hashCode()) != null)
        i--;
    else
       sampledLines.put(line.hashCode(),line);
}

Here, sampledLines should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.

EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.

EDIT 2: I made it verify uniqueness of lines by using a HashMap.

like image 60
S E Avatar answered Dec 31 '22 13:12

S E


Pre-process the input file and remember the offset of each new line. Use a BitSet to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.

like image 45
Marko Topolnik Avatar answered Dec 31 '22 14:12

Marko Topolnik