Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate Random Data for Cassandra DB

I have a big data project for school that requires us to build and query a 8 node Cassandra system. The system must contain at least seven terabytes of data. I have to generate all this data myself. There is no requirement that the data be "relevant" to the assignment -- ie each column can just be a random int. That being said it is required that each value is random or based on a random sequence.

So, I wrote a simple java program to just generate random ints. I can generate ~200 MB of random test data in ~120s. Now unless my math is off, then I think I'm in a pickle.

There are 35000 200MB units in 7 terabytes.

35000 * 120 = 4 200 000 seconds

4 200 000 / 3600 ~ 1167hours

1167 / 24 = 49 days

So, it appears that it will take 49 days to generate all the test data needed. Obviously, this is impractical. I'm looking for suggestions that will increase the rate at which I can generate data.

I've considered/considering:

setting replication factor to 8 to reduce the amount of data needed to be generated, and also running the data generation program on all 8 nodes.

edit: how I'm generating the data

private void initializeCols(){
    cols = new ArrayList<Generator>();
    cols.add(new IntGenerator(400));
}

public ArrayList<String> generatePage(){
    ArrayList<String> page = new ArrayList<String>();
    String line = "";
    for(int i = 0; i < PAGE_SIZE; i++){
        line = "";
        for(Generator column : cols){
            line += column.gen();
        }
        page.add(line);
    }
    return page;
}

originally I was generating more test specific data like phone numbers etc. but then I decided to just generate random ints in order to shave some time off -- not much savings. Here is the IntGenerator class.

public IntGenerator(int series){
    this.series = series;
}

public String gen(){
    String output = "";

    for(int i = 0; i < series; i++){
        output += Integer.toString(randomInt(1,1000));
        output += SEPERATOR; 
    }
    return output;
}
like image 435
slmyers Avatar asked Feb 11 '26 03:02

slmyers


2 Answers

Use cassandra stress 2.1

And this tool to generate your yaml.

You'll have random data in C* in minutes, no coding!

like image 174
phact Avatar answered Feb 15 '26 05:02

phact


As you are performing a lot of concatenation in loops, I highly recommend you check out StringBuilder. It will dramatically increase the speed of your loops. For example,

public String gen(){
    StringBuilder sb = new StringBuilder();
    for(int i = 0; i < series; i++){
        sb.append(Integer.toString(randomInt(1,1000)));
        sb.append(SEPERATOR); 
    }
    return sb.toString();
}

And you should do similar in your generatePage method as well.

like image 20
Andrew Alderson Avatar answered Feb 15 '26 06:02

Andrew Alderson