I would like to randomize the lines in a file which has over 32 million lines of 10 digit strings. I am aware of how to do it with File.ReadAllLines(...).OrderBy(s => random.Next()).ToArray()
but this is not memory efficient since it will load everything into memory (over 1.4GB) and works only with x64 architecture.
The alternative would be to split it and randomize the shorter files and then merge them but I was wondering if there is a better way to do this.
Using the shuf Command The shuf utility is a member of the GNU Coreutils package. It outputs a random permutation of the input lines. The shuf command will load all input data into memory during the shuffling, and it won't work if the input file is larger than the free memory.
Your current approach will allocate at least 2 large string arrays (probably more -- I don't know how OrderBy is implemented, but it probably does its own allocations).
If you randomize the data "in-place" by doing random permutations between the lines (e.g. using the Fisher-Yates shuffle), it will minimize the memory usage. Of course it will still be large if the file is large, but you won't allocate more memory than necessary.
EDIT: if all lines have the same length (*), it means you can do random access to a given line in the file, so you can do the Fisher-Yates shuffle directly in the file.
(*) and assuming you're not using an encoding where characters can have different byte lengths, like UTF-8
This application demonstrates what you want, using a byte array
Peak memory usage is about 400 MB. Runs in about 20 seconds on my machine (mostly file IO).
public class Program
{
private static Random random = new Random();
public static void Main(string[] args)
{
// create massive file
var random = new Random();
const int lineCount = 32000000;
var file = File.CreateText("BigFile.txt");
for (var i = 0; i < lineCount ; i++)
{
file.WriteLine("{0}",i.ToString("D10"));
}
file.Close();
int sizeOfRecord = 12;
var loadedLines = File.ReadAllBytes("BigFile.txt");
ShuffleByteArray(loadedLines, lineCount, sizeOfRecord);
File.WriteAllBytes("BigFile2.txt", loadedLines);
}
private static void ShuffleByteArray(byte[] byteArray, int lineCount, int sizeOfRecord)
{
var temp = new byte[sizeOfRecord];
for (int i = lineCount - 1; i > 0; i--)
{
int j = random.Next(0, i + 1);
// copy i to temp
Buffer.BlockCopy(byteArray, sizeOfRecord * i, temp, 0, sizeOfRecord);
// copy j to i
Buffer.BlockCopy(byteArray, sizeOfRecord * j, byteArray, sizeOfRecord * i, sizeOfRecord);
// copy temp to j
Buffer.BlockCopy(temp, 0, byteArray, sizeOfRecord * j, sizeOfRecord);
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With