Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Randomize lines of really huge text file

I would like to randomize the lines in a file which has over 32 million lines of 10 digit strings. I am aware of how to do it with File.ReadAllLines(...).OrderBy(s => random.Next()).ToArray() but this is not memory efficient since it will load everything into memory (over 1.4GB) and works only with x64 architecture.

The alternative would be to split it and randomize the shorter files and then merge them but I was wondering if there is a better way to do this.

like image 718
idipous Avatar asked Nov 20 '13 09:11

idipous


People also ask

How do you shuffle a line in a text file in Linux?

Using the shuf Command The shuf utility is a member of the GNU Coreutils package. It outputs a random permutation of the input lines. The shuf command will load all input data into memory during the shuffling, and it won't work if the input file is larger than the free memory.


2 Answers

Your current approach will allocate at least 2 large string arrays (probably more -- I don't know how OrderBy is implemented, but it probably does its own allocations).

If you randomize the data "in-place" by doing random permutations between the lines (e.g. using the Fisher-Yates shuffle), it will minimize the memory usage. Of course it will still be large if the file is large, but you won't allocate more memory than necessary.


EDIT: if all lines have the same length (*), it means you can do random access to a given line in the file, so you can do the Fisher-Yates shuffle directly in the file.

(*) and assuming you're not using an encoding where characters can have different byte lengths, like UTF-8

like image 95
Thomas Levesque Avatar answered Oct 23 '22 10:10

Thomas Levesque


This application demonstrates what you want, using a byte array

  1. It creates a file with the padded numbers 0 to 32000000.
  2. It loads the file, then shuffles them in memory, using a block-copying Fisher-Yates method.
  3. Finally, it writes the file back out in shuffled order

Peak memory usage is about 400 MB. Runs in about 20 seconds on my machine (mostly file IO).

public class Program
{
    private static Random random = new Random();

    public static void Main(string[] args)
    {
        // create massive file
        var random = new Random();
        const int lineCount = 32000000;

        var file = File.CreateText("BigFile.txt");

        for (var i = 0; i < lineCount ; i++)
        {
            file.WriteLine("{0}",i.ToString("D10"));
        }

        file.Close();

        int sizeOfRecord = 12;

        var loadedLines = File.ReadAllBytes("BigFile.txt");

        ShuffleByteArray(loadedLines, lineCount, sizeOfRecord);

        File.WriteAllBytes("BigFile2.txt", loadedLines);
    }

    private static void ShuffleByteArray(byte[] byteArray, int lineCount, int sizeOfRecord)
    {
        var temp = new byte[sizeOfRecord];

        for (int i = lineCount - 1; i > 0; i--)
        {
            int j = random.Next(0, i + 1);
            // copy i to temp
            Buffer.BlockCopy(byteArray, sizeOfRecord * i, temp, 0, sizeOfRecord);
            // copy j to i
            Buffer.BlockCopy(byteArray, sizeOfRecord * j, byteArray, sizeOfRecord * i, sizeOfRecord);
            // copy temp to j
            Buffer.BlockCopy(temp, 0, byteArray, sizeOfRecord * j, sizeOfRecord);
        }
    }
}
like image 22
Baldrick Avatar answered Oct 23 '22 09:10

Baldrick