Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Billions rows mysql table / cassandra model - (and a bit of biology :))

I'm working on a biology lab and I have to design a database in order to store many DNA microarray experiments results.

Each experiment is componed of many microarrays (about ten in average) and each microarray contains over 5 millions probes. Each probe is mapped to a particular gene id, of course the same probe match the same gene_id in all the experiments. The aim is to store the intensity values of each microarray in order to be able to quickly retrieve the intensity values of the probes of a particular gene id in a particular experiment.

In fact a simple mysql table would be enough, it would look like that :

intensity table : |probe_id|experiment_id|microarray_id|gene_id|intensity_value

With a primary key composed of (probe_id, experiment_id, microarray_id, gene_id)

Here's the problem : each experiment has many microarray which has over 5 millions probes. With 1000 experiments, 10 microarrays on average (wich is a low estimation, some have hundreds), its 1000 * 10 * 5M = 50 Billions rows. I guess it would be slow. And I have absolutely no idea about how to handle a billions rows mysql table. Is that possible ? Any tips ?

I'm also curious about noSQL databases. I never used cassandra but it seems to me that it would be perfect for this task, am I right ? I can imagine a shema like this :

{
experiment_id_1:{ <- thats a super collumnFamilly ?
    gene_id_1:{ <- thats a collumnFamilly ?
        probe_id_1:{ value_microarray_1, value_microarray_2, ... }, <- thats a superCollumn ?
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    },
    gene_id_2:{
        probe_id_1:{ value_microarray_1, value_microarray_2, ... },
        probe_id_2:{ value_microarray_1, value_microarray_2, ... },
        probe_id_3:{ value_microarray_1, value_microarray_2, ... },
        ...
    }
}
experiment_id_2{
    ...
}
...
}

Am I rigth ? Would it fit the cassandra model ? Would it be efficient ? What do you think noSQL guru :)

Thanks.

like image 376
Pierre Avatar asked Apr 05 '12 11:04

Pierre


People also ask

Can MySQL handle billions of rows?

The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it's possible that things may have improved. It worked. MySQL processed the data correctly most of the time.

Can MySQL handle 100 billions of records?

Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.

How many rows can a MySQL table have?

The MySQL maximum row size limit of 65,535 bytes is demonstrated in the following InnoDB and MyISAM examples. The limit is enforced regardless of storage engine, even though the storage engine may be capable of supporting larger rows.

Which database is best for millions of records?

Oracle Database Oracle has provided high-quality database solutions since the 1970s. The most recent version of Oracle Database was designed to integrate with cloud-based systems, and it allows you to manage massive databases with billions of records.


1 Answers

I would consider in this approach also a relational over a NoSQL database. If you make some considerations, you would be able to check if you are able to handle the data:

  1. What is the expectet size of the table, to get a rough idea check the size of one set of data and calculate the overall size by multiplying it with the total expecet number of datasets.
  2. Calculate the size of your indexes
  3. Now check, if your server is able to handle those indexes in RAM or better, handle the whole table in RAM.
  4. What is the ratio between DML/selecting operations on this table
  5. Make sure you have strategies in place how you will deal with common tasks like backuping, optimizing, altering those kind of tables.

If I have to deal which such kind of situations, I usually generate some test data which is simmilar to the one i expect to have in my table and play arround with different server parameters. Also I consider using partitioning of tables in this case (for example partition over the experiment_id. This will result in table splitted to smaller subsets, which can be coped with in means of existing hardware boundaries. Don´t you dare to make this by yourself, MySQL can do this for you and the table will be presented as single table to the user. But the machine only has to deal with the part, where the datasets for a given experiment_id are stored. This results in much faster I/O etc.

I already have seen machines which deal with tables with much more than your expected row count easily, but you have to plan such setups carefully and it usually requires a lot of testing/optimizing/redesigning before you can go in production with it. But it is always worth to take this effort, since it is a quite interesting thing to deal with.

(I made my first experiences in this field while dealing with embl data during my study times, and it became my passion ;))

like image 69
Flo Doe Avatar answered Sep 25 '22 13:09

Flo Doe