So I have a super simple database in mongoDB with a few collections:
> show collections
Aggregates <-- count: 92
Users <-- count: 68222
Pages <-- count: 1728288847, about 1.1TB
system.indexes
The Aggregates
collection is an aggregate of the Pages
collection, and each document looks like this:
> db.Aggregates.findOne()
{
"_id" : ObjectId("50f237126ba71610eab3aaa5"),
"daily_total_pages" : 16929799,
"day" : 21,
"month" : 9,
"year" : 2011
}
Very simple. However, let's try and get the total page loads by adding all 92 days daily page loads
together:
>>> def get_total():
... start = datetime.now()
... print sum([x['daily_total_pages'] for x in c.Aggregates.find()])
... end = datetime.now()
... print (end-start).seconds
...
>>> get_total()
1728288847
43
43 seconds?!??!??!?!
Those 92 aggregate results are tiny! I might as well just store them in a text file, this is crazy.
Or are they tiny? How big are they on disk according to mongo?
> db.Aggregates.stats()
{
"ns" : "c.AggregateResults",
"count" : 92,
"size" : 460250104,
"avgObjSize" : 5002718.521739131,
"storageSize" : 729464832,
"numExtents" : 7,
"nindexes" : 2,
"lastExtentSize" : 355647488,
"paddingFactor" : 1.0690000000000066,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 16352,
"indexSizes" : {
"_id_" : 8176,
"date_1" : 8176
},
"ok" : 1
}
438 megabytes total for those tiny daily numbers? Each one is approximately 280 bytes, so they should be a maximum of 25~30kb total. So the storage is huge and the query is super slow. Is it possible it could be fragmented on disk? I created the aggregates after inserting the documents into the the full Pages
collection.
Anyone have any insights into this madness? :O
Edit: Solved with more specific find() query by Jared. The video below that Sammaye provided also gives some very interesting storage insights.
Edit 2: So I found out that using sys.getsizeof() is a really unreliable method of finding out your document's sizes since it doesn't recurse down any trees. So actually my docs were quite large, and the best thing to do was to use find({}, {'daily_page_loads'}) as a more specific query!
The avgObjSize
is out of line with the 280 byte estimate. It's saying your objects are averaging around 5MB and storageSize
is near 1GB. If you're memory-constrained running a query that needs to access all 1GB of file would cause lots of page faults.
Have you tried compacting?
db.runCommand({compact: 'Aggregates'})
or repairing?
db.repairDatabase()
If that doesn't work try pulling back just those fields needed for the sum rather than pulling the whole document. It may be that those documents are actually 5MB and the time is spent pulling data over the wire.
def get_total():
start = datetime.now()
print sum([x['daily_total_pages'] for x in c.Aggregates.find({}, {"daily_total_pages": 1})])
end = datetime.now()
print (end-start).seconds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With