I'm using the blobstore to backup and recovery entities in csv format. The process is working well for all of my smaller models. However, once I start to work on models with more than 2K entities, I am exceeded the soft memory limit. I'm only fetching 50 entities at a time and then writing the results out to the blobstore, so I'm not clear why my memory usage would be building up. I can reliably make the method fail just by increasing the "limit" value passed in below which results in the method running just a little longer to export a few more entities.
Any recommendations on how to optimize this process to reduce memory consumption?
Also, the files produced will only <500KB in size. Why would the process use 140 MB of memory?
Simplified example:
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
writer = csv.DictWriter(f, fieldnames=properties)
for entity in models.Player.all():
row = backup.get_dict_for_entity(entity)
writer.writerow(row)
Produces the error: Exceeded soft private memory limit with 150.957 MB after servicing 7 requests total
Simplified example 2:
The problem seems to be with using files and the with statement in python 2.5. Factoring out the csv stuff, I can reproduce almost the same error by simply trying to write a 4000 line text file to the blobstore.
from __future__ import with_statement
from google.appengine.api import files
from google.appengine.ext.blobstore import blobstore
file_name = files.blobstore.create(mime_type='application/octet-stream')
myBuffer = StringIO.StringIO()
#Put 4000 lines of text in myBuffer
with files.open(file_name, 'a') as f:
for line in myBuffer.getvalue().splitlies():
f.write(line)
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
Produces the error: Exceeded soft private memory limit with 154.977 MB after servicing 24 requests total
def backup_model_to_blobstore(model, limit=None, batch_size=None):
file_name = files.blobstore.create(mime_type='application/octet-stream')
# Open the file and write to it
with files.open(file_name, 'a') as f:
#Get the fieldnames for the csv file.
query = model.all().fetch(1)
entity = query[0]
properties = entity.__class__.properties()
#Add ID as a property
properties['ID'] = entity.key().id()
#For debugging rather than try and catch
if True:
writer = csv.DictWriter(f, fieldnames=properties)
#Write out a header row
headers = dict( (n,n) for n in properties )
writer.writerow(headers)
numBatches = int(limit/batch_size)
if numBatches == 0:
numBatches = 1
for x in range(numBatches):
logging.info("************** querying with offset %s and limit %s", x*batch_size, batch_size)
query = model.all().fetch(limit=batch_size, offset=x*batch_size)
for entity in query:
#This just returns a small dictionary with the key-value pairs
row = get_dict_for_entity(entity)
#write out a row for each entity.
writer.writerow(row)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
blob_key = files.blobstore.get_blob_key(file_name)
return blob_key
The error looks like this in the logs
......
2012-02-02 21:59:19.063
************** querying with offset 2050 and limit 50
I 2012-02-02 21:59:20.076
************** querying with offset 2100 and limit 50
I 2012-02-02 21:59:20.781
************** querying with offset 2150 and limit 50
I 2012-02-02 21:59:21.508
Exception for: Chris (202.161.57.167)
err:
Traceback (most recent call last):
.....
blob_key = backup_model_to_blobstore(model, limit=limit, batch_size=batch_size)
File "/base/data/home/apps/singpath/163.356548765202135434/singpath/backup.py", line 125, in backup_model_to_blobstore
writer.writerow(row)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 281, in __exit__
self.close()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 275, in close
self._make_rpc_call_with_retry('Close', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 388, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 236, in _make_call
_raise_app_error(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 179, in _raise_app_error
raise FileNotOpenedError()
FileNotOpenedError
C 2012-02-02 21:59:23.009
Exceeded soft private memory limit with 149.426 MB after servicing 14 requests total
You'd be better off not doing the batching yourself, but just iterating over the query. The iterator will pick a batch size (probably 20) that should be adequate:
q = model.all()
for entity in q:
row = get_dict_for_entity(entity)
writer.writerow(row)
This avoids re-running the query with ever-increasing offset, which is slow and causes quadratic behavior in the datastore.
An oft-overlooked fact about memory usage is that the in-memory representation of an entity can use 30-50 times the RAM compared to the serialized form of the entity; e.g. an entity that is 3KB on disk might use 100KB in RAM. (The exact blow-up factor depends on many factors; it's worse if you have lots of properties with long names and small values, even worse for repeated properties with long names.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With