Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?
\n
readLine()
read a random line from the file readLine()
should return the same line twiceUpdate:
Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.
In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile
instead of a standard java FileInputStream
. With RandomAccessFile
, you can use the seek(long position)
method to skip to an arbitrary place in the file and start reading there. The code would look something like this.
RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
//seek to a random point in the file
raf.seek((long)(Math.random()*raf.length()));
//skip from the random location to the beginning of the next line
int nextByte = raf.read();
while(((char)nextByte) != '\n')
{
if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
nextByte = raf.read();
}
//read the line into a buffer
StringBuffer lineBuffer = new StringBuffer();
nextByte = raf.read();
while(nextByte != -1 && (((char)nextByte) != '\n'))
lineBuffer.append((char)nextByte);
//ensure uniqueness
String line = lineBuffer.toString();
if(sampledLines.get(line.hashCode()) != null)
i--;
else
sampledLines.put(line.hashCode(),line);
}
Here, sampledLines
should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.
EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.
EDIT 2: I made it verify uniqueness of lines by using a HashMap
.
Pre-process the input file and remember the offset of each new line. Use a BitSet
to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With