Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a random line of a text file in Java?

Tags:

java

file

random

Say there is a file too big to be put to memory. How can I get a random line from it? Thanks.

Update: I want to the probabilities of getting each line to be equal.

like image 229
Fluffy Avatar asked Feb 07 '10 19:02

Fluffy


People also ask

How do you read a specific line from a text file in Java?

Java supports several file-reading features. One such utility is reading a specific line in a file. We can do this by simply providing the desired line number; the stream will read the text at that location. The Files class can be used to read the n t h nth nth line of a file.


4 Answers

Reading the entire file if you want only one line seems a bit excessive. The following should be more efficient:

  1. Use RandomAccessFile to seek to a random byte position in the file.
  2. Seek left and right to the next line terminator. Let L the line between them.
  3. With probability (MIN_LINE_LENGTH / L.length) return L. Otherwise, start over at step 1.

This is a variant of rejection sampling.

Line lengths include the line terminator character(s), hence MIN_LINE_LENGTH >= 1. (All the better if you know a tighter bound on line length).

It is worth noting that the runtime of this algorithm does not depend on file size, only on line length, i.e. it scales much better than reading the entire file.

like image 165
meriton Avatar answered Nov 16 '22 11:11

meriton


Here's a solution. Take a look at the choose() method which does the real thing (the main() method repeatedly exercises choose(), to show that the distribution is indeed quite uniform).

The idea is simple: when you read the first line it has a 100% chance of being chosen as the result. When you read the 2nd line it has a 50% chance of replacing the first line as the result. When you read the 3rd line it has a 33% chance of becoming the result. The fourth line has a 25%, and so on....

import java.io.*;
import java.util.*;

public class B {

  public static void main(String[] args) throws FileNotFoundException {
     Map<String,Integer> map = new HashMap<String,Integer>();
     for(int i = 0; i < 1000; ++i)
     {
        String s = choose(new File("g:/temp/a.txt"));
        if(!map.containsKey(s))
           map.put(s, 0);
        map.put(s, map.get(s) + 1);
     }

     System.out.println(map);
  }

  public static String choose(File f) throws FileNotFoundException
  {
     String result = null;
     Random rand = new Random();
     int n = 0;
     for(Scanner sc = new Scanner(f); sc.hasNext(); )
     {
        ++n;
        String line = sc.nextLine();
        if(rand.nextInt(n) == 0)
           result = line;         
     }

     return result;      
  }
}
like image 37
Itay Maman Avatar answered Nov 16 '22 12:11

Itay Maman


Either you

  1. read the file twice - once to count the number of lines, the second time to extract a random line, or

  2. use reservoir sampling

like image 10
Will Avatar answered Nov 16 '22 11:11

Will


Looking over Itay's answer, it looks as though it reads the file a thousand times over after sampling one line of the code, whereas true reservoir sampling should only go over the 'tape' once. I've devised some code to go over code once with real reservoir sampling, based on this and the various descriptions on the web.

import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.List;

public class reservoirSampling {

    public static void main(String[] args) throws FileNotFoundException, IOException{
        Sampler mySampler = new Sampler();
        List<String> myList = mySampler.sampler(10);
        for(int index = 0;index<myList.size();index++){
            System.out.println(myList.get(index));
        }
    }
}

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.Scanner;

public class Sampler {

    public Sampler(){}
    public List<String> sampler (int reservoirSize) throws FileNotFoundException, IOException
    {
        String currentLine=null;
        //reservoirList is where our selected lines stored
        List <String> reservoirList= new ArrayList<String>(reservoirSize); 
        // we will use this counter to count the current line number while iterating
        int count=0; 

        Random ra = new Random();
        int randomNumber = 0;
        Scanner sc = new Scanner(new File("Open_source.html")).useDelimiter("\n");
        while (sc.hasNext())
        {
            currentLine = sc.next();
            count ++;
            if (count<=reservoirSize)
            {
                reservoirList.add(currentLine);
            }
            else if ((randomNumber = (int) ra.nextInt(count))<reservoirSize)
            {
                reservoirList.set(randomNumber, currentLine);
            }
        }
        return reservoirList;
    }
}

The basic premise is that you fill up the reservoir, and then go back to it and fill in random lines with a 1/ReservoirSize chance. I hope this provides more efficient code. Please let me know if this doesn't work for you, as I've literally knocked it up in half an hour.

like image 6
AncientSwordRage Avatar answered Nov 16 '22 12:11

AncientSwordRage