Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.
Update: I want to the probabilities of getting each line to be equal.
Java supports several file-reading features. One such utility is reading a specific line in a file. We can do this by simply providing the desired line number; the stream will read the text at that location. The Files class can be used to read the n t h nth nth line of a file.
Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:
This is a variant of rejection sampling.
Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).
It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.
Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).
The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....
import java.io.*;
import java.util.*;
public class B {
public static void main(String[] args) throws FileNotFoundException {
Map<String,Integer> map = new HashMap<String,Integer>();
for(int i = 0; i < 1000; ++i)
{
String s = choose(new File("g:/temp/a.txt"));
if(!map.containsKey(s))
map.put(s, 0);
map.put(s, map.get(s) + 1);
}
System.out.println(map);
}
public static String choose(File f) throws FileNotFoundException
{
String result = null;
Random rand = new Random();
int n = 0;
for(Scanner sc = new Scanner(f); sc.hasNext(); )
{
++n;
String line = sc.nextLine();
if(rand.nextInt(n) == 0)
result = line;
}
return result;
}
}
Either you
read the file twice - once to count the number of lines, the second time to extract a random line, or
use reservoir sampling
Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;
public class reservoirSampling {
public static void main(String[] args) throws FileNotFoundException, IOException{
Sampler mySampler = new Sampler();
List<String> myList = mySampler.sampler(10);
for(int index = 0;index<myList.size();index++){
System.out.println(myList.get(index));
}
}
}
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;
public class Sampler {
public Sampler(){}
public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
{
String currentLine=null;
//reservoirList is where our selected lines stored
List <String> reservoirList= new ArrayList<String>(reservoirSize);
// we will use this counter to count the current line number while iterating
int count=0;
Random ra = new Random();
int randomNumber = 0;
Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
while (sc.hasNext())
{
currentLine = sc.next();
count ++;
if (count<=reservoirSize)
{
reservoirList.add(currentLine);
}
else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
{
reservoirList.set(randomNumber, currentLine);
}
}
return reservoirList;
}
}
The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With