I'm working on a biology lab and I have to design a database in order to store many DNA microarray experiments results.
Each experiment is componed of many microarrays (about ten in average) and each microarray contains over 5 millions probes. Each probe is mapped to a particular gene id, of course the same probe match the same gene_id in all the experiments. The aim is to store the intensity values of each microarray in order to be able to quickly retrieve the intensity values of the probes of a particular gene id in a particular experiment.
In fact a simple mysql table would be enough, it would look like that :
intensity table : |probe_id|experiment_id|microarray_id|gene_id|intensity_value
With a primary key composed of (probe_id, experiment_id, microarray_id, gene_id)
Here's the problem : each experiment has many microarray which has over 5 millions probes. With 1000 experiments, 10 microarrays on average (wich is a low estimation, some have hundreds), its 1000 * 10 * 5M = 50 Billions rows. I guess it would be slow. And I have absolutely no idea about how to handle a billions rows mysql table. Is that possible ? Any tips ?
I'm also curious about noSQL databases. I never used cassandra but it seems to me that it would be perfect for this task, am I right ? I can imagine a shema like this :
{
experiment_id_1:{ <- thats a super collumnFamilly ?
gene_id_1:{ <- thats a collumnFamilly ?
probe_id_1:{ value_microarray_1, value_microarray_2, ... }, <- thats a superCollumn ?
probe_id_2:{ value_microarray_1, value_microarray_2, ... },
probe_id_3:{ value_microarray_1, value_microarray_2, ... },
...
},
gene_id_2:{
probe_id_1:{ value_microarray_1, value_microarray_2, ... },
probe_id_2:{ value_microarray_1, value_microarray_2, ... },
probe_id_3:{ value_microarray_1, value_microarray_2, ... },
...
}
}
experiment_id_2{
...
}
...
}
Am I rigth ? Would it fit the cassandra model ? Would it be efficient ? What do you think noSQL guru :)
Thanks.
The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it's possible that things may have improved. It worked. MySQL processed the data correctly most of the time.
Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.
The MySQL maximum row size limit of 65,535 bytes is demonstrated in the following InnoDB and MyISAM examples. The limit is enforced regardless of storage engine, even though the storage engine may be capable of supporting larger rows.
Oracle Database Oracle has provided high-quality database solutions since the 1970s. The most recent version of Oracle Database was designed to integrate with cloud-based systems, and it allows you to manage massive databases with billions of records.
I would consider in this approach also a relational over a NoSQL database. If you make some considerations, you would be able to check if you are able to handle the data:
If I have to deal which such kind of situations, I usually generate some test data which is simmilar to the one i expect to have in my table and play arround with different server parameters. Also I consider using partitioning of tables in this case (for example partition over the experiment_id. This will result in table splitted to smaller subsets, which can be coped with in means of existing hardware boundaries. Don´t you dare to make this by yourself, MySQL can do this for you and the table will be presented as single table to the user. But the machine only has to deal with the part, where the datasets for a given experiment_id are stored. This results in much faster I/O etc.
I already have seen machines which deal with tables with much more than your expected row count easily, but you have to plan such setups carefully and it usually requires a lot of testing/optimizing/redesigning before you can go in production with it. But it is always worth to take this effort, since it is a quite interesting thing to deal with.
(I made my first experiences in this field while dealing with embl data during my study times, and it became my passion ;))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With