I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.
The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.
The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).
Another idea is to write a program that will create files with dummy data.
Any other idea?
Sources for Finding Large DatasetsPage from the CISER Data Archive at Cornell Institute for Social and Economic Research. 'Find, download, and use datasets that are generated and held by the Federal Government. ' U.S. government website with links to health-related datasets from a variety of health agencies.
This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).
However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.
I can offer several answers:
If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.
If you only need to avoid exact duplicates, you could try a combination of your two ideas---create corrupted copies of a relatively small data set. "Corruption" operations might include: replacement, insertion, deletion, and character swapping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With