Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way of storing trend data?

I am currently building an application where I am importing statistical data for (currently) around 15,000 products. At current, if I was to maintain one database table for each day statistics from one source it would be increased by 15,000 rows of data (let's say 5-10 fields per row primarily float, int) per day. Obviously equating to over 5 million records per year into one table.

That doesn't concern me so much as the thought of bringing in data from other sources (and thus increasing the size the database by 5 million records for each new source).

Now the data is statistical / trending based data, and will have basically 1 write per day per record, and many reads. For purposes of on the fly reporting and graphing however I need fast access to subsets of the data based on rules (date ranges, value ranges, etc).

What my question is, is this the best way to store the data (MySQL InnoDb tables), or is there a better way to store and handle statistical/trend data?

Other options I have tossed around at this point: 1. Multiple databases (one per product), with separate tables for each data source within. (ie Database: ProductA, Table(s):Source_A, Source_B, Source_C) 2. One database, multiple tables (one for each product/data source) (ie Database: Products, Table(s): ProductA_SourceA, ProductA_SourceB, etc.) 3. All factual or specific product information in the database and all statistical data in csv, xml, json, (flat files) in separate directories.

So far, none of these options are very manageable, each has its pros and cons. I need a reasonable solution before I move into the alpha stage of development.

like image 399
Aaron Murray Avatar asked Apr 20 '11 02:04

Aaron Murray


2 Answers

You could try making use of a column based database. These kinds of databases are much better at analytical queries of the kind you're describing. There are several options:

http://en.wikipedia.org/wiki/Column-oriented_DBMS

We've had good experience with InfiniDB:

http://infinidb.org/

and Infobright looks good as well:

http://www.infobright.com/

Both InfiniDB and Infobright have free open source community editions, so I would recommend using these to get some benchmarks on the kinds of performance benefit you might get.

You might also want to look at partitioning your data to improve performance.

like image 64
srkiNZ84 Avatar answered Oct 06 '22 01:10

srkiNZ84


It's a little bit dependent upon what your data looks like, and the kind of aggregations/trends you're looking to run. Most relational databases work just fine for this sort of chronological data. Even with billions of records, proper indexing and partitioning can make quick work work of finding the records you need. DB's like Oracle, MySQL, SQL-Server fall within this category.

Lets say the products you work with are stocks, and for each stock you get a new price every day (a very realistic case). New exchanges, stocks, trade frequencies will grow this data exponentially pretty quickly. You could however partition the data by exchange. Or region.

Various Business Intelligence tools are also able to assist in, what effectively amounts to pre-aggregating data prior to retrieval. This is basically a Column-oriented database as was suggested. (Data Warehouses and OLAP structures can assist in massaging and aggregating data sets ahead of time).

Similar to the idea of data warehousing, if it's just a matter of the aggregations taking too long, you can work-off the aggregations overnight into a structure which is more quick to query from. In my previous example, you may only need to retrieve large chunks of data very infrequently, but more often some aggregation such as 52 week high. You can store the large amount of raw data in one format, and then every night have a job work off only what you need into a table which rather than thousands of data points per stock, now has 3 or 4.

If the trends you're tracking are really all over the place, or complex algorithms, a full fledged BI solution might be something to investigate so you can use pre-built analityic and data mining algorithms.

If the data is not very structured, you may have better luck with a NoSQL database like Hadoop or Mongo, although admittedly my knowledge of databases is more focused around relational formats.

like image 45
Xedni Avatar answered Oct 06 '22 00:10

Xedni