Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

report generation for million rows in mysql

Tags:

mysql

We are creating a site and for which we read the txt file( using PHP-mysql) from the server and save it to the mysql database ( using load in file query) . Its a temporary database and then this data is moved to the other database with some additional info like ids, codes etc.Temporary database is then emptied and will be dumped with another txt file.

the database with additional info is growing at a faster pace.

The text files are continuesly coming and the database is growing like 2 million records in a day. roughly on average there are about 15-20 files comes and at any time interval. it doesnt have any specific time that the files wont come.

we are doing some group by queries and scaling down the data in ( database with additional info) to about 47k from 7 million records and saving it in the other database for report generation . This data(47K) is for 4 days( report database) and for report purpose . so at the end of the month the main database with ids, codes might grow to 70 million records.

we have to generate the reports at the same time from the table on the daily basis and on weekly basis and also on monthly basis.

since the database is growing at the faster rate , how to handle the reports after 1 month eg if they want the report from jan to april eg..

how we are suppose to handle that scenario since the read/write are being done on the same table that is growing at the enormous rate.

like image 721
dell Avatar asked Nov 12 '22 04:11

dell


1 Answers

Without really knowing what is going on with the data, but it doesn't make much sense to me that you read from txt file into data base... massage it and add other link codes, then re-dump it to a txt file just to keep it off the database. I would leave it on the database.

At what rate (disk capacity too) are the records getting added?

What types of reports are being generated... Statistical? General aggregations?

If just general aggregations, I would keep a separate table on the database with whatever rolled-up numbers categories, classifications, whatever. Then reports could be run from that instead of re-importing all those text files you dumped out. Having a "summary" table like on an hourly basis (or even daily basis) could easily simplify the reporting process.

Without seeing some sampling of the data and expected reporting output, it would be hard to solidify any single approach.

---- Per comment feedback

having dealt with call-data-record (CDR) information in the past, there is a lot of junk out there too. The overall record of what I would consider critical would be rather limited. Also, knowing phone companies have just a few million calls per day and having to track them could overload your database disk capacity so some archiving process would make sense, but general roll-ups not as much. Even if you were tracking every originating number and how many times calling a destination number... or how many common destination numbers coming from different origination numbers could be rolled-up and limiting your overall reporting stats.

--- Ex: If the gov't tracking calls (no... they do that???) on people, they might want to know how many different people are calling person X and how many person X is calling too. The concern is in BOTH directions of communications. Once a pattern is identified and throwing alarms could they focus and retain records in a separate tracking table that is NOT dumped after the output text files are created... And no, I don't work for the gov't or any such agencies who might do this.

-- What to keep I can only offer suggestions on what TYPE of information are you trying to retain... do you care about the PBX system, trunk lines originated from, Extensions? I would start by retaining the critical elements you want first and stripping what might be useless information from an every record retention basis.

Origination Calling Number/Extension
Destination Number
International Country being called
Call Time Start
Call Duration

It's been over 10 yrs since I've worked with PBX systems and Call-Data-Records, but I know these should be common core to the data. Depending on how the CDR layout is, you might have to standardize the number being called. if one system stores as one full phone number, and another keeps separate for int'l calling prefix, you won't find a match of dialing

555-1234 vs +014 555-1234

I know the fake phone number isn't legit, but I'm sure you understand where I'm coming from.

like image 118
DRapp Avatar answered Nov 15 '22 06:11

DRapp