I am currently logging every "failure" on my site (login/signup/etc) to a database so I can monitor what is giving my users a hard time - or which ips/users are doing suspicious things.
However, I find that I only really need the data for about a week or so since I check it every day and, at most, need to see the activity from the past week.
I was thinking that perhaps I should try to save some of the load my database is taking from all this logging and place the data in something like memcached or couchdb. However, I'm not sure how I could query the data into result sets.
How could you use a key-value store or document-database to monitor logs and track relations between activity? And is it even worth it to add another data store to the server or just keep the database from handling it? I mention memcached and couchdb because both can have very light RAM usage if needed (unlike mongodb and redis).
Let me give an example. IP 0.0.0.0 failed login 37 times in 3 hours (each recorded) it also failed to reset a password for a valid email 84 times in 2 hours. Thanks to my logs I can now research (and block) this bot. On the other hand, I see that out of the 5827 users registered - there were 2188 failed register attempts. This tells me that there is something wrong with my signup form causing many of the people to fail the form at least once.
Again, the bounty is for a working example of using key-value or document store to log data.
Just write to a log file and analyse it offline. Logging is a solved problem, and writing a line of text to a file on disk is about as cheap, IO and CPU-wise, as you can possibly get. Log rotation is also a solved problem, and there's really no point in reinventing that wheel.
Once the log data is on disk, you can copy it off to another machine for parsing and analysis using whatever toolkit you want, and if you want to use a document store, that's the place to introduce it. There's no need to burden your front-facing production machines with that job.
Key-value stores or document-based databases are not a panacea. If you want to play with them just for fun, then it's fine, but if you want to do this to save some of the load my database, I highly recommend not to waste your time. Let me explain.
First, you must realize that these data structures recently became popular because of need for scalability for ultra-large sites (LinkedIn, Facebook, etc.). And more importantly, they gave that portion of scalability at a price of convenience.
Think of these new generation data stores as of stripped down databases which don't have inter-table relationships and SQL layer. So writes become cheap as there's no need to worry about dependant data. But reads can then become expensive (if you don't have an index) as you have to deal with O(n) complexity. This is OK for cases where key id is always known or for post-processing jobs where response time isn't a big deal. Or, you can have fast searches with index on a flat document, but don't expect foreign keys to be handled automatically.
If you were to log data into a k-v store, you could possibly solve your query problem by logging entire record into a k-v store and logging keys (ids) for "failure" cases separately (e.g. could be stored under special key). Afterwards, you could locate the offending records in O(1) time. Need to quickly lookup different cases (failed to reset password, failed to register)? No problem, just add another "special" key and reindex all existing data :) You have been warned about the loss of convenience!
If you were to log data into a document store, you could only benefit if your log records are flat (de-normalized). Otherwise I don't see how could you store the data in them it the first place. You could then create indices based on event type and query by it. I however don't see any big difference/improvement from what you have now.
But think about it. You will probably spend weeks (if not months) rewriting, debugging and testing existing logging code. You will have to define different backup strategies. You will have pain explaining that to your sysadmins, bosses, etc. Or you can buy a few hundred $ worth of SSD disk and achieve the same, if not better, results.
So, if I understand you correctly:
A couple things about non-relational solutions and document-oriented stores in particular: 1. They don't require you to be Facebook or Twitter. Setup for both MongoDB and CouchDB does not have to be an enterprise undertaking. 2. They are well-suited to storing log and event data. 3. Both CouchDB and MongoDB will take advantage of as much memory as is available to cache their indexes. 4. MongoDB offers a "capped" collection, which sets size limits on the data stored and then rolls of data rows/messages as they age out. This seems particularly suited to your needs, were you to implement MongoDB, as it doesn't require you to continually run heavy deletes against your relational database. 5. The query interface is materially different from the SQL you are used to. Both can take JSON-based query documents and bring back results. MongoDB's function library is easier for a relational developer to pick up, IMHO.
That said, here's the rub: 1. If you aren't going to set it up on a different machine, you won't solve the load issue. Non-relational stores are not as efficient with disk or memory as your MySQL instance is. 2. Both store data in JSON. If your logging component doesn't speak JSON, you'll need to code that. 3. If you rely on regex, Couch won't do it. Mongo will.
Mindas is right when he says that non-relational stores achieve their scale by ripping out the foundational aspects of relational stores: ACID transactions, strongly-typed data, well-defined structures, optimized join relations, efficient data storage.
That said, things like log records that have a finite lifespan, variant content and a flat structure are ideal for a document store and don't require as much infrastructure. I've spent literally decades building relational structures that work well for 90% of a system and shoe-horning the rest in through a series of work-arounds.
For me, non-relational stores offer the chance to keep flat, schema-variant data in a more natural form.
I hope this helps you find a path that works for you.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With