Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb model for Uniqueness

Tags:

mongodb

model

Scenario:

10.000.000 record/day

Records: Visitor, day of visit, cluster (Where do we see it), metadata

What we want to know with this information:

  1. Unique visitor on one or more clusters for a given range of dates.
  2. Unique Visitors by day
  3. Grouping metadata for a given range (Platform, browser, etc)

The model i stick with in order to easily query this information is:

{
   VisitorId:1, 
ClusterVisit: [
                {clusterId:1, dates:[date1, date2]},
                {clusterId:2, dates:[date1, date3]}
              ]
}

Index:

  1. by VisitorId (to ensure Uniqueness)
  2. by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
  3. by IdUser-ClusterVisit.IdCluster (for updating)

I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.

Importing: First we search for a combination of VisitorId - ClusterId and we addToSet the date.

Second: If first doesn't match, we upsert:

    $addToSet: {VisitorId:1, 
        ClusterVisit: [{clusterId:1, dates:[date1]}]
    }

With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.

Problems: totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date. Difficult to maintain (unset dates mostly)

i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.

I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.

I have a x1.large instance on AWS RAID 10 with 10 disks

like image 386
Nicolas Alejo Avatar asked Jan 24 '26 20:01

Nicolas Alejo


1 Answers

Arrays are expensive on large collections: mapreduce, aggregate...

Try .explain(): MongoDB 'count()' is very slow. How do we refine/work around with it?

Add explicit hints for index: Simple MongoDB query very slow although index is set

A full heap?: Insert performance of node-mongodb-native

The end of memory space for collection: How to improve performance of update() and save() in MongoDB?

Special read clustering: http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/

Global write lock?: mongodb bad performance

Slow logs performance track: Track MongoDB performance?

Rotate your logs: Does logging output to an output file affect mongoDB performance?

Use profiler: http://www.mongodb.org/display/DOCS/Database+Profiler

Move some collection caches to RAM: MongoDB preload documents into RAM for better performance

Some ideas about collection allocation size: MongoDB data schema performance

Use separate collections: MongoDB performance with growing data structure

A single query can only use one index (better is a compound one): Why is this mongodb query so slow?

A missing key?: Slow MongoDB query: can you explain why?

Maybe shards: MongoDB's performance on aggregation queries

Improving performance stackoverflow links: https://stackoverflow.com/a/7635093/602018

A good point for further sharding replica education is: https://education.10gen.com/courses

like image 94
42n4 Avatar answered Jan 27 '26 13:01

42n4