We are planning on using MongoDB to store large amounts of analytics data such as views and clicks. I'm unsure on the best way to structure the documents within MongoDB to aid querying and reduce database size.
We need to record actions agains a pagename, client and the type of action. Ideally we need stats which go down the the year/month/day/hour level, we don't need or care about views per second or minute. While this document structure looks ok, I'm aware 100 vistors would generate a 100 new documents.
{
"_id" : ObjectId( "4dabdef81a34961506040000" ),
"pagename" : "Hello",
"action" : "view",
"client" : "client-name",
"time" : Date( "Mon Apr 18 07:49:28 2011" )
}
Is there best practice way of doing this, either using $inc or Capped Collections?
In MongoDB, use GridFS for storing files larger than 16 MB. In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem. If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.
It can process large amounts of real-time data very quickly because of in-memory calculations. MongoDB: MongoDB is a NoSQL database. It has a flexible schema. MongoDB stores huge amounts of data in a naturally traversable format, making it a good choice to store, query, and analyze big data.
Operational and analytical workloads in a single platformMongoDB brings both operational (or transactional) and analytical workload types together, while also providing workload isolation and optimization controls for price and performance. Easily build analytically-driven applications in a single, scalable platform.
Working with MongoDB and ElasticSearch is an accurate decision to process millions of records in real-time. These structures and concepts could be applied to larger datasets and will work extremely well too.
Updated answer
Hacked together in the mongo shell:
use pagestats;
// a little helper function
var pagePerHour = function(pagename) {
d = new Date();
return {
page : pagename,
year: d.getUTCFullYear(),
month: d.getUTCMonth(),
day : d.getUTCDate(),
hour: d.getUTCHours(),
}
}
// a pageview happened
db.pagestats.update(
pagePerHour('Hello'),
{ $inc : { views : 1 }},
true ); //we want to upsert
// somebody tweeted our page twice!
db.pagestats.update(
pagePerHour('Hello'),
{ $inc : { tweets : 2 }},
true ); //we want to upsert
db.pagestats.find();
// { "_id" : ObjectId("4dafe88a02662f38b4a20193"),
// "year" : 2011, "day" : 21, "hour" : 8, "month" : 3,
// "page" : "Hello",
// "tweets" : 2, "views" : 1 }
// 24 hour summary 'Hello' on 2011-4-21
for(i = 0; i < 24; i++) {
//careful: days (1-31), month (0-11) and hours (0-23)
stats = db.pagestats.findOne({ page: 'Hello', year: 2011, month: 3, day : 21, hour : i})
if(stats) {
print(i + ': ' + stats.views + ' views')
} else {
print(i + ': no hits')
};
}
Depending on which aspects you want to track you might consider adding more collections (e.g. a collection for user centric tracking). Hope that helps.
See also
Blogpost about Analytics Data
I wouldn't worry too much about space, Mongo can scale pretty much infinitely in that regard, adding more space would be reasonably cheap.
One thing to be aware of is the fact that if you keep updating a document its size will grow, which means Mongo will eventually need to find a new place for it in the index. If you have a lot of documents being updated and increasing in size Mongo will need to copy these documents around a lot, this can slow stuff down significantly. Of course this all depends on how much traffic you're expecting.
Based on my experience, go with a simple document format where you don't need to update the documents, it might complicate your querying later on, but you can use map/reduce to get whatever information you want regardless of your document structure (map reduce is very flexible given enough experience you can do anything).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With