Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MongoDB Approaches for storing large amounts of metrics / analytics data

We are planning on using MongoDB to store large amounts of analytics data such as views and clicks. I'm unsure on the best way to structure the documents within MongoDB to aid querying and reduce database size.

We need to record actions agains a pagename, client and the type of action. Ideally we need stats which go down the the year/month/day/hour level, we don't need or care about views per second or minute. While this document structure looks ok, I'm aware 100 vistors would generate a 100 new documents.

{ 
  "_id" : ObjectId( "4dabdef81a34961506040000" ),
  "pagename" : "Hello",
  "action" : "view",
  "client" : "client-name",
  "time" : Date( "Mon Apr 18 07:49:28 2011" )
}

Is there best practice way of doing this, either using $inc or Capped Collections?

like image 670
Tom Avatar asked Apr 19 '11 06:04

Tom


People also ask

How does MongoDB store large data?

In MongoDB, use GridFS for storing files larger than 16 MB. In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem. If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.

Is MongoDB good for large amounts of data?

It can process large amounts of real-time data very quickly because of in-memory calculations. MongoDB: MongoDB is a NoSQL database. It has a flexible schema. MongoDB stores huge amounts of data in a naturally traversable format, making it a good choice to store, query, and analyze big data.

Can MongoDB be used for analytics?

Operational and analytical workloads in a single platformMongoDB brings both operational (or transactional) and analytical workload types together, while also providing workload isolation and optimization controls for price and performance. Easily build analytically-driven applications in a single, scalable platform.

Can MongoDB handle millions of data?

Working with MongoDB and ElasticSearch is an accurate decision to process millions of records in real-time. These structures and concepts could be applied to larger datasets and will work extremely well too.


2 Answers

Updated answer

Hacked together in the mongo shell:

use pagestats;

// a little helper function
var pagePerHour = function(pagename) {
    d = new Date();
    return {
        page : pagename,
        year: d.getUTCFullYear(),
        month: d.getUTCMonth(),
        day : d.getUTCDate(),
        hour: d.getUTCHours(),
    }
}

// a pageview happened
db.pagestats.update(
    pagePerHour('Hello'),
    { $inc : { views : 1 }},
    true ); //we want to upsert

// somebody tweeted our page twice!
db.pagestats.update(
    pagePerHour('Hello'),
    { $inc : { tweets : 2 }},
    true ); //we want to upsert

db.pagestats.find();
// { "_id" : ObjectId("4dafe88a02662f38b4a20193"),
//   "year" : 2011, "day" : 21, "hour" : 8, "month" : 3,
//   "page" : "Hello",
//   "tweets" : 2, "views" : 1 }

// 24 hour summary 'Hello' on 2011-4-21
for(i = 0; i < 24; i++) {
    //careful: days (1-31), month (0-11) and hours (0-23)
    stats = db.pagestats.findOne({ page: 'Hello', year: 2011, month: 3, day : 21, hour : i})
    if(stats) {
        print(i + ': ' + stats.views + ' views')
    } else {
        print(i + ': no hits')
    };
}

Depending on which aspects you want to track you might consider adding more collections (e.g. a collection for user centric tracking). Hope that helps.

See also

Blogpost about Analytics Data

like image 88
Matt Avatar answered Sep 22 '22 15:09

Matt


I wouldn't worry too much about space, Mongo can scale pretty much infinitely in that regard, adding more space would be reasonably cheap.

One thing to be aware of is the fact that if you keep updating a document its size will grow, which means Mongo will eventually need to find a new place for it in the index. If you have a lot of documents being updated and increasing in size Mongo will need to copy these documents around a lot, this can slow stuff down significantly. Of course this all depends on how much traffic you're expecting.

Based on my experience, go with a simple document format where you don't need to update the documents, it might complicate your querying later on, but you can use map/reduce to get whatever information you want regardless of your document structure (map reduce is very flexible given enough experience you can do anything).

like image 36
skorks Avatar answered Sep 19 '22 15:09

skorks