Billions rows mysql table / cassandra model - (and a bit of biology :))

Tags:

I'm working on a biology lab and I have to design a database in order to store many DNA microarray experiments results.

Each experiment is componed of many microarrays (about ten in average) and each microarray contains over 5 millions probes. Each probe is mapped to a particular gene id, of course the same probe match the same gene_id in all the experiments. The aim is to store the intensity values of each microarray in order to be able to quickly retrieve the intensity values of the probes of a particular gene id in a particular experiment.

In fact a simple mysql table would be enough, it would look like that :

With a primary key composed of (probe_id, experiment_id, microarray_id, gene_id)

Here's the problem : each experiment has many microarray which has over 5 millions probes. With 1000 experiments, 10 microarrays on average (wich is a low estimation, some have hundreds), its 1000 * 10 * 5M = 50 Billions rows. I guess it would be slow. And I have absolutely no idea about how to handle a billions rows mysql table. Is that possible ? Any tips ?

I'm also curious about noSQL databases. I never used cassandra but it seems to me that it would be perfect for this task, am I right ? I can imagine a shema like this :

{
experiment_id_1:{ <- thats a super collumnFamilly ?
    gene_id_1:{ <- thats a collumnFamilly ?
        probe_id_1:{ value_microarray_1, value_microarray_2, ... }, <- thats a superCollumn ?
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    },
    gene_id_2:{
        probe_id_1:{ value_microarray_1, value_microarray_2, ... },
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    }
}
experiment_id_2{
    ...
}
...
}

Am I rigth ? Would it fit the cassandra model ? Would it be efficient ? What do you think noSQL guru :)

Thanks.

376

asked Apr 05 '12 11:04

Pierre

1 Answers

I would consider in this approach also a relational over a NoSQL database. If you make some considerations, you would be able to check if you are able to handle the data:

What is the expectet size of the table, to get a rough idea check the size of one set of data and calculate the overall size by multiplying it with the total expecet number of datasets.
Calculate the size of your indexes
Now check, if your server is able to handle those indexes in RAM or better, handle the whole table in RAM.
What is the ratio between DML/selecting operations on this table
Make sure you have strategies in place how you will deal with common tasks like backuping, optimizing, altering those kind of tables.

If I have to deal which such kind of situations, I usually generate some test data which is simmilar to the one i expect to have in my table and play arround with different server parameters. Also I consider using partitioning of tables in this case (for example partition over the experiment_id. This will result in table splitted to smaller subsets, which can be coped with in means of existing hardware boundaries. Don´t you dare to make this by yourself, MySQL can do this for you and the table will be presented as single table to the user. But the machine only has to deal with the part, where the datasets for a given experiment_id are stored. This results in much faster I/O etc.

I already have seen machines which deal with tables with much more than your expected row count easily, but you have to plan such setups carefully and it usually requires a lot of testing/optimizing/redesigning before you can go in production with it. But it is always worth to take this effort, since it is a quite interesting thing to deal with.

(I made my first experiences in this field while dealing with embl data during my study times, and it became my passion ;))

answered Sep 25 '22 13:09

Flo Doe

Related questions
                            
                                MySQL Query Optimization - inner queries
                            
                                Git environment setup. Advice needed
                            
                                Can't update table that trigger is executed on in AFTER INSERT
                            
                                How to get database name from postgres/mysql cursor object
                            
                                MYSQL if not exists insert data
                            
                                MySQL : Access denied for user ... using password : YES
                            
                                Optimizing MySQL query with SUM, date range and Group By
                            
                                MySQL, multiple rows to separate fields
                            
                                How can I refer to a TEMPORARY table more than once in the same query?
                            
                                slow inserts mysql
                            
                                Retrieve the column names from an empty MySQL query result
                            
                                Easy way to store metadata about MySQL-Database
                            
                                Storing post bodies in database or files?
                            
                                Select random row per distinct field value?
                            
                                Segmentation Fault on MySQL2 / Ruby 1.9.3 / Rails 3.2
                            
                                Show "0" in COUNT column when WHERE does not match?
                            
                                Equivalent in MySQL to MS SQL Jobs?
                            
                                MySQL my.cnf performance tuning recommendations [closed]
                            
                                mysql dump tables only
                            
                                MySQL How to INSERT INTO [temp table] FROM [Stored Procedure]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Billions rows mysql table / cassandra model - (and a bit of biology :))

Tags:

mysql

nosql

cassandra

Pierre

People also ask

1 Answers

Flo Doe

Recent Activity

Donate For Us