How to produce massive amount of data?

Tags:

I'm doing some testing with nutch and hadoop and I need a massive amount of data. I want to start with 20GB, go to 100 GB, 500 GB and eventually reach 1-2 TB.

The problem is that I don't have this amount of data, so I'm thinking of ways to produce it.

The data itself can be of any kind. One idea is to take an initial set of data and duplicate it. But its not good enough because need files that are different from one another (Identical files are ignored).

Another idea is to write a program that will create files with dummy data.

Any other idea?

245

asked Dec 29 '11 12:12

AAaa

2 Answers

This may be a better question for the statistics StackExchange site (see, for instance, my question on best practices for generating synthetic data).

However, if you're not so interested in the data properties as the infrastructure to manipulate and work with the data, then you can ignore the statistics site. In particular, if you are not focused on statistical aspects of the data, and merely want "big data", then we can focus on how one can generate a large pile of data.

I can offer several answers:

If you are just interested in random numeric data, generate a large stream from your favorite implementation of the Mersenne Twister. There is also /dev/random (see this Wikipedia entry for more info). I prefer a known random number generator, as the results can be reproduced ad nauseam by anyone else.
For structured data, you can look at mapping random numbers to indices and create a table that maps indices to, say, strings, numbers, etc., such as one might encounter in producing a database of names, addresses, etc. If you have a large enough table or a sufficiently rich mapping target, you can reduce the risk of collisions (e.g. same names), though perhaps you'd like to have a few collisions, as these occur in reality, too.
Keep in mind that with any generative method you need not store the entire data set before beginning your work. As long as you record the state (e.g. of the RNG), you can pick up where you left off.
For text data, you can look at simple random string generators. You might create your own estimates for the probability of strings of different lengths or different characteristics. The same can go for sentences, paragraphs, documents, etc. - just decide what properties you'd like to emulate, create a "blank" object, and fill it with text.

119

answered Oct 05 '22 22:10

Iterator

If you only need to avoid exact duplicates, you could try a combination of your two ideas---create corrupted copies of a relatively small data set. "Corruption" operations might include: replacement, insertion, deletion, and character swapping.

answered Oct 05 '22 22:10

jrennie

Related questions
                            
                                Spring HttpRemoting client as a Java Configuration Bean
                            
                                Does creating a File object create a physical file or touch anything outside the JVM?
                            
                                JFormattedTextField issues
                            
                                Check if Cookie exists with JSP EL
                            
                                Testing interactions with external services
                            
                                How to handle generics inside a Java "annotation processor"?
                            
                                EC2 Java Api Wait till Ec2 Instance gets created.
                            
                                Load balanced service using Zookeeper and Thrift
                            
                                Clearing session, flushing, refreshing, after hibernate bulk updates?
                            
                                User authentication in SOAP Webservices
                            
                                Return data from Servlet to Java Client
                            
                                Regex and escaped and unescaped delimiter
                            
                                Access to Java jar from Delphi
                            
                                What happens to this thread runnable at the end once it is completed?
                            
                                Java Generics: Array containing generics [duplicate]
                            
                                FileNotFound (Access is denied) Exception on java.io
                            
                                Setting Timeout for Axis SOAP Webservice
                            
                                Auto-resizing JButton Icon
                            
                                Solr sorting issue
                            
                                Can I access a Scala object's val without parentheses from Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to produce massive amount of data?

Tags:

java

hadoop

bigdata

nutch

AAaa

People also ask

2 Answers

Iterator

jrennie

Recent Activity

Donate For Us