Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A huge data storage problem

I'm starting to design a new application that will be used by about 50000 devices. Each device generates about 1440 registries a day, this means that will be stored over 72 million of registries per day. These registries keep coming every minute, and I must be able to query this data by a Java application (J2EE). So it need to be fast to write, fast to read and indexed to allow report generation. Devices only insert data and the J2EE application will need to read then occasionally. Now I'm looking to software alternatives to support this kind of operation.

  • Putting this data on a single table would lead to a catastrophic condition, because I won't be able to use this data due to its amount of data stored over a year.

  • I'm using Postgres, and database partitioning seems not to be a answer, since I'd need to partition tables by month, or may be more granular approach, days for example.

I was thinking on a solution using SQLite. Each device would have its own SQLite database, than the information would be granular enough for good maintenance and fast insertions and queries.

What do you think?

like image 581
gmuller Avatar asked Jan 22 '23 15:01

gmuller


2 Answers

  1. Record only changes of device positions - most of the time any device will not move - a car will be parked, a person will sit or sleep, a phone will be on unmoving person or charged etc. - this would make you an order of magnitude less data to store.

  2. You'll be generating at most about 1TB a year (even when not implementing point 1), which is not a very big amount of data. This means about 30MB/s of data, which single SATA drive can handle.

  3. Even a simple unpartitioned Postgres database on not too big hardware should manage to handle this. The only problem could be when you'll need to query or backup - this can be resolved by using a Hot Standby mirror using Streaming Replication - this is a new feature in soon to be released PostgreSQL 9.0. Just query against / backup a mirror - if it is busy it will temporarily and automatically queue changes, and catch up later.

  4. When you really need to partition do it for example on device_id modulo 256 instead of time. This way you'd have writes spread out on every partition. If you partition on time just one partition will be very busy on any moment and others will be idle. Postgres supports partitioning this way very well. You can then also spread load to several storage devices using tablespaces, which are also well supported in Postgres.

like image 138
Tometzky Avatar answered Jan 25 '23 07:01

Tometzky


Time-interval partitioning is a very good solution, even if you have to roll your own. Maintaining separate connections to 50,000 SQLite databases is much less practical than a single Postgres database, even for millions of inserts a day.

Depending on the kind of queries that you need to run against your dataset, you might consider partitioning your remote devices across several servers, and then query those servers to write aggregate data to a backend server.

The key to high-volume tables is: minimize the amount of data you write and the number of indexes that have to be updated; don't do UPDATEs or DELETEs, only INSERTS (and use partitioning for data that you will delete in the future—DROP TABLE is much faster than DELETE FROM TABLE!).

Table design and query optimization becomes very database-specific as you start to challenge the database engine. Consider hiring a Postgres expert to at least consult on your design.

like image 27
Seamus Campbell Avatar answered Jan 25 '23 06:01

Seamus Campbell