For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats. <ul> <li>The content of the files should be neither completely random nor uniform. A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.</li> <li>The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.</li> <li>I'd like to keep the number of files at a manageable level, let's say o(10). </li> </ul> For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this: <pre class="prettyprint"><code>Int64 bytesRemaining = size; byte[] buffer = new byte[sz]; using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write)) { while (bytesRemaining > 0) { int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining; if (!zeroes) _rnd.NextBytes(buffer); fileStream.Write(buffer, 0, sizeOfChunkToWrite); bytesRemaining -= sizeOfChunkToWrite; } fileStream.Close(); } </code></pre> With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want. For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time. Neither of these is quite satisfactory for me. I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary. The tests run on Windows. Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large. What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file. Currently I have an approach that sort of works but it takes too long to run. Has anyone else solved this? Is there a much faster way to write a text file than via StreamWriter? Suggestions? EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in). You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with. If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally. Edit Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

How can I quickly create large (>1gb) text+binary files with "natural" content? (C#)

Q: How do I create a 10gb dummy?

There are two commands you can enter in the Command Prompt to create a dummy file: fsutil file createnew filename size. fsutil file createnew pathfilename size.

Q: Which method is used to write contents to a binary file?

Writing: fwrite() function is used in the program to write the data read from the text file to the binary file in binary form.

Q: How do I make a large text file in Linux?

To create a large text file of the specific size, we can use the dd command. The dd command takes four arguments; source, destination, block-size, and counter. It reads the source and copies the contents of the source on the destination. It uses the block-size and the counter to control the copy operation.

Tags:

c#

.net

filesystems

windows

testing

For purposes of testing compression, I need to be able to create large files, ideally in text, binary, and mixed formats.

The content of the files should be neither completely random nor uniform.
A binary file with all zeros is no good. A binary file with totally random data is also not good. For text, a file with totally random sequences of ASCII is not good - the text files should have patterns and frequencies that simulate natural language, or source code (XML, C#, etc). Pseudo-real text.
The size of each individual file is not critical, but for the set of files, I need the total to be ~8gb.
I'd like to keep the number of files at a manageable level, let's say o(10).

For creating binary files, I can new a large buffer and do System.Random.NextBytes followed by FileStream.Write in a loop, like this:

Int64 bytesRemaining = size;
byte[] buffer = new byte[sz];
using (Stream fileStream = new FileStream(Filename, FileMode.Create, FileAccess.Write))
{
    while (bytesRemaining > 0)
    {
        int sizeOfChunkToWrite = (bytesRemaining > buffer.Length) ? buffer.Length : (int)bytesRemaining;
        if (!zeroes) _rnd.NextBytes(buffer);
        fileStream.Write(buffer, 0, sizeOfChunkToWrite);
        bytesRemaining -= sizeOfChunkToWrite;
    }
    fileStream.Close();
}

With a large enough buffer, let's say 512k, this is relatively fast, even for files over 2 or 3gb. But the content is totally random, which is not what I want.

For text files, the approach I have taken is to use Lorem Ipsum, and repeatedly emit it via a StreamWriter into a text file. The content is non-random and non-uniform, but it does has many identical repeated blocks, which is unnatural. Also, because the Lorem Ispum block is so small (<1k), it takes many loops and a very, very long time.

Neither of these is quite satisfactory for me.

I have seen the answers to Quickly create large file on a windows system?. Those approaches are very fast, but I think they just fill the file with zeroes, or random data, neither of which is what I want. I have no problem with running an external process like contig or fsutil, if necessary.

The tests run on Windows.
Rather than create new files, does it make more sense to just use files that already exist in the filesystem? I don't know of any that are sufficiently large.

What about starting with a single existing file (maybe c:\windows\Microsoft.NET\Framework\v2.0.50727\Config\enterprisesec.config.cch for a text file) and replicating its content many times? This would work with either a text or binary file.

Currently I have an approach that sort of works but it takes too long to run.

Has anyone else solved this?

Is there a much faster way to write a text file than via StreamWriter?

Suggestions?

EDIT: I like the idea of a Markov chain to produce a more natural text. Still need to confront the issue of speed, though.

289

asked Jun 24 '09 11:06

Cheeso

1 Answers

For text, you could use the stack overflow community dump, there is 300megs of data there. It will only take about 6 minutes to load into a db with the app I wrote and probably about the same time to dump all the posts to text files, that would easily give you anywhere between 200K to 1 Million text files, depending on your approach (with the added bonus of having source and xml mixed in).

You could also use something like the wikipedia dump, it seems to ship in MySQL format which would make it super easy to work with.

If you are looking for a big file that you can split up, for binary purposes, you could either use a VM vmdk or a DVD ripped locally.

Edit

Mark mentions the project gutenberg download, this is also a really good source for text (and audio) which is available for download via bittorrent.

answered Sep 18 '22 20:09

Sam Saffron

Related questions
                            
                                Windows Forms GUI hangs when calling OpenFileDialog.ShowDialog()
                            
                                How to programatically select first row of DataGridView [duplicate]
                            
                                DateTime 25 years back from today
                            
                                Optimizing this C# algorithm (K Difference)
                            
                                Time spend running program
                            
                                How come my class take so much space in memory?
                            
                                DateTime string format last digit of year
                            
                                how can I check if a string is a positive integer?
                            
                                Why do static variables not allow recursion?
                            
                                Async property in c#
                            
                                Testing a difficult to reach code path
                            
                                MemoryStream to string[]
                            
                                Unexpected results in Linq query always + 1
                            
                                Why would ref be used for array parameters in C#?
                            
                                WPF XAML Parse Exception occured Error?
                            
                                Why does the C# compiler not even warn about endless recursion?
                            
                                How can I convert an ExpandoObject to Dictionary in C#?
                            
                                How to post string array using POSTMAN?
                            
                                Double quotes in c# doesn't allow multiline
                            
                                Confusion with NULLs in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With