Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

  • The content of the files should be neither completely random nor uniform.
    A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
  • The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
  • I'd like to keep the number of files at a manageable level, let's say o(10).

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

Neither of these is quite satisfactory for me.

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

Currently I have an approach that sort of works but it takes too long to run.

Has anyone else solved this?

Is there a much faster way to write a text file than via StreamWriter?

Suggestions?

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

like image 289
Cheeso Avatar asked Jun 24 '09 11:06

Cheeso


People also ask

How do I create a 10gb dummy?

There are two commands you can enter in the Command Prompt to create a dummy file: fsutil file createnew filename size. fsutil file createnew pathfilename size.

Which method is used to write contents to a binary file?

Writing: fwrite() function is used in the program to write the data read from the text file to the binary file in binary form.

How do I make a large text file in Linux?

To create a large text file of the specific size, we can use the dd command. The dd command takes four arguments; source, destination, block-size, and counter. It reads the source and copies the contents of the source on the destination. It uses the block-size and the counter to control the copy operation.


1 Answers

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

Edit

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

like image 59
Sam Saffron Avatar answered Sep 18 '22 20:09

Sam Saffron