I have a big data project for school that requires us to build and query a 8 node Cassandra system. The system must contain at least seven terabytes of data. I have to generate all this data myself. There is no requirement that the data be "relevant" to the assignment -- ie each column can just be a random int. That being said it is required that each value is random or based on a random sequence.
So, I wrote a simple java program to just generate random ints. I can generate ~200 MB of random test data in ~120s. Now unless my math is off, then I think I'm in a pickle.
There are 35000 200MB units in 7 terabytes.
35000 * 120 = 4 200 000 seconds
4 200 000 / 3600 ~ 1167hours
1167 / 24 = 49 days
So, it appears that it will take 49 days to generate all the test data needed. Obviously, this is impractical. I'm looking for suggestions that will increase the rate at which I can generate data.
I've considered/considering:
setting replication factor to 8 to reduce the amount of data needed to be generated, and also running the data generation program on all 8 nodes.
edit: how I'm generating the data
private void initializeCols(){
cols = new ArrayList<Generator>();
cols.add(new IntGenerator(400));
}
public ArrayList<String> generatePage(){
ArrayList<String> page = new ArrayList<String>();
String line = "";
for(int i = 0; i < PAGE_SIZE; i++){
line = "";
for(Generator column : cols){
line += column.gen();
}
page.add(line);
}
return page;
}
originally I was generating more test specific data like phone numbers etc. but then I decided to just generate random ints in order to shave some time off -- not much savings. Here is the IntGenerator class.
public IntGenerator(int series){
this.series = series;
}
public String gen(){
String output = "";
for(int i = 0; i < series; i++){
output += Integer.toString(randomInt(1,1000));
output += SEPERATOR;
}
return output;
}
Use cassandra stress 2.1
And this tool to generate your yaml.
You'll have random data in C* in minutes, no coding!
As you are performing a lot of concatenation in loops, I highly recommend you check out StringBuilder. It will dramatically increase the speed of your loops. For example,
public String gen(){
StringBuilder sb = new StringBuilder();
for(int i = 0; i < series; i++){
sb.append(Integer.toString(randomInt(1,1000)));
sb.append(SEPERATOR);
}
return sb.toString();
}
And you should do similar in your generatePage method as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With