Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PHP/MYSQL - Pushing it to the limit?

I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.

I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.

I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.

I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.

Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.

Example Insert: IP, time(), http_referer, etc.

I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.

If anyone can help me, I'm a great coder, I would be more than willing to return the favor.

like image 927
Mickey Avatar asked Apr 20 '09 04:04

Mickey


2 Answers

We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.

This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.

Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.

Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.

like image 112
Anthony Avatar answered Oct 09 '22 23:10

Anthony


For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.

You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.

MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.

Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.

like image 26
staticsan Avatar answered Oct 10 '22 00:10

staticsan