Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomly select elements from a big text file

Tags:

java

file

random

I have a big file which has 30 Million User ID's. That big file will look something like this and each line will have an User Id.

149905320
1165665384
66969324
886633368
1145241312
286585320
1008665352
1135545396
186217320
132577356

Now, I am planning to get any random line from that big text file. I know the total number of user id's in that big text file. I am not sure what is the best way to select random elements from that big text file. I was thinking to store all those 30Million user ids in a set and then randomly select elements from the hastset but with this approach it will go with out of memory error.

So that is the reason, I am trying to randomly select elements from a big text file.

final String id = generateRandomUserId(random);

/**
 * Select random elements from the a big text file
 * 
 * @param userIdsSet2
 * @param r
 * @return
 */
private String generateRandomUserId(Random r) {

     File bigFile = new File("C:\\bigfile.txt");

     //randomly select elements from a big text file         


}

what is the best way to do this?

like image 528
AKIWEB Avatar asked Jun 20 '13 00:06

AKIWEB


2 Answers

You could do like so :

  • Get the size of the file (in bytes)
  • Pick a byte (randomly chosen number in [0..file.length()] - RandomAccessFile)
  • Seek to that position in the file (file.seek(number))
  • Seek to the position right after the next \n character (file.seek(1))
  • Read line (file.readLine())

for example...

This way you don't have to store anything.

A sample theoretic snippet could look like this (contains some side effects) :

File f = new File("D:/abc.txt");
RandomAccessFile file;
try {
    file = new RandomAccessFile(f, "r");
    long file_size = file.length();
    long chosen_byte = (long)(Math.random() * file_size);

    file.seek(chosen_byte);

    for (;;)
    {
        byte a_byte = file.readByte();
        char wordChar = (char)a_byte;
        if (chosen_byte >= file_size || wordChar == '\n' || wordChar == '\r' || wordChar == -1) break;
        else chosen_byte += 1;
        System.out.println("\"" + Character.toString(wordChar)  + "\"");
    }

    int chosen = -1;
    if (chosen_byte < file_size) 
    {
        String s = file.readLine();
        chosen = Integer.parseInt(s);
        System.out.println("Chosen id : \"" + s  + "\"");
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}


EDIT : Full working (theoretically) class

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.RandomAccessFile;


public class Main {

    /**
     * WARNING : This piece of code requires that the input file terminates by a BLANK line !
     * 
     * @param args
     * @throws Exception 
     */
    public static void main(String[] args) throws Exception {

        File f = new File("D:/abc.txt");
        RandomAccessFile file;

        try {

            file = new RandomAccessFile(f, "r");
            long file_size = file.length();

            // Let's start
            long chosen_byte = (long)(Math.random() * (file_size - 1));
            long cur_byte = chosen_byte;

            // Goto starting position
            file.seek(cur_byte);

            String s_LR = "";
            char a_char;

            // Get left hand chars
            for (;;)
            {
                a_char = (char)file.readByte();
                if (cur_byte < 0 || a_char == '\n' || a_char == '\r' || a_char == -1) break;
                else 
                {
                    s_LR = a_char + s_LR;
                    --cur_byte;
                    if (cur_byte >= 0) file.seek(cur_byte);
                    else break;
                }
            }

            // Get right hand chars
            cur_byte = chosen_byte + 1;
            file.seek(cur_byte);
            for (;;)
            {
                a_char = (char)file.readByte();
                if (cur_byte >= file_size || a_char == '\n' || a_char == '\r' || a_char == -1) break;
                else 
                {
                    s_LR += a_char;
                    ++cur_byte;
                }
            }

            // Parse ID
            if (cur_byte < file_size) 
            {
                int chosen_id = Integer.parseInt(s_LR);
                System.out.println("Chosen id : " + chosen_id);
            }
            else
            {
                throw new Exception("Ran out of bounds. But this usually never happen...");
            }

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}


Hope this is not too wrong as an implementation (I'm more C++ those days)...

like image 179
Gauthier Boaglio Avatar answered Nov 18 '22 05:11

Gauthier Boaglio


Instead of storing the user ids in a hash you could parse the file and store just the offsets in an int[] array - 30M would take ~120MB of RAM.

Alternatively if you can change or preprocess the file in some way you could change the format to fixed width by padding the user ids or use a binary format.

like image 29
gordy Avatar answered Nov 18 '22 05:11

gordy