Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Database design advice needed

I'm a lone developer for a telecoms company, and am after some database design advice from anyone with a bit of time to answer.

I am inserting into one table ~2 million rows each day, these tables then get archived and compressed on a monthly basis. Each monthly table contains ~15,000,000 rows. Although this is increasing month on month.

For every insert I do above I am combining the data from rows which belong together and creating another "correlated" table. This table is currently not being archived, as I need to make sure I never miss an update to the correlated table. (Hope that makes sense) Although in general this information should remain fairly static after a couple of days of processing.

All of the above is working perfectly. However my company now wishes to perform some stats against this data, and these tables are getting too large to provide the results in what would be deemed a reasonable time. Even with the appropriate indexes set.

So I guess after all the above my question is quite simple. Should I write a script which groups the data from my correlated table into smaller tables. Or should I store the queries result sets in something like memcache? I'm already using mysqls cache, but due to having limited control over how long the data is stored for, it's not working ideally.

The main advantages I can see of using something like memcache:

  • No blocking on my correlated table after the query has been cashed.
  • Greater flexibility of sharing the collected data between the backend collector and front end processor. (i.e custom reports could be written in the backend and the results of these stored in the cache under a key which then gets shared with anyone who would want to see the data of this report)
  • Redundancy and scalability if we start sharing this data with a large amount of customers.

The main disadvantages I can see of using something like memcache:

  • Data is not persistent if machine is rebooted / cache is flushed.

The main advantages of using MySql

  • Persistent data.
  • Less code changes (although adding something like memcache is trivial anyway)

The main disadvantages of using MySql

  • Have to define table templates every time I want to store provide a new set of grouped data.
  • Have to write a program which loops through the correlated data and fills these new tables.
  • Potentially will still grow slower as the data continues to be filled.

Apologies for quite a long question. It's helped me to write down these thoughts here anyway, and any advice/help/experience with dealing with this sort of problem would be greatly appreciated.

Many thanks.

Alan

like image 332
Alan Hollis Avatar asked May 27 '10 08:05

Alan Hollis


3 Answers

(Another answer from me, different enough that I'll post it separately)

Two questions:

What sort of stats does your company want to generate?
and
After rows are inserted into the database, are they ever changed?

If data doesn't change after insert, then you may be able to build up a separate 'stats' table, that you amend/update as new rows are inserted, or maybe soon after new rows are inserted.

e.g. things like:

  • When a new row is inserted thats relevant to stat 'B', go and increment a number in another table for stat 'B', minute 'Y'
    or
  • Every hour, run a small query on rows that have been inserted in the last hour, that generates the stats for that hour and stores them separately
    or
  • As above, but each minute, etc.

Its hard to be any more specific without knowing the details, but depending on the stats you're after, these kind of approaches may help.

like image 85
codeulike Avatar answered Nov 18 '22 00:11

codeulike


If you want to do some analysis of static data from a few days back, you should perhaps consider using something like a OLAP system.

Basicly, this type of system stock intermediate stats in their format to do quick sum(), avg(), count()... on large table.

I think your question is a perfect example of the situation where it's used, but perhaps i think so just because it's my job. =)

Take a look.

like image 1
Syprien Avatar answered Nov 17 '22 23:11

Syprien


I work in a company with similar situation, with millions of inserts monthly.

We adopted the strategy of summarize the data in smaller tables, grouped by certain fields.

In our case, when an insert is performed, it triggers a function which classifies the inserted tuple and increment the summary tables.

From time to time, we move the oldest rows to a backup table, reducing the growth of the main table.

like image 1
pcent Avatar answered Nov 17 '22 23:11

pcent