We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.
One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:
query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)
We limited ourselves to deleting 100 ParentKind
entities at a time; each ParentKind
had around 40 child ChildKindA
and ChildKindB
entities total - perhaps 4000 entities.
This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.
This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?
We changed our queries to be keys_only
, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.
Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:
query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
to_delete.append(entity)
to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)
The results from Appstats:
service.call #RPCs real time api time
datastore_v3.RunQuery 22 352ms 555ms
datastore_v3.Delete 1 366ms 132825ms
taskqueue.BulkAdd 1 7ms 0ms
The Delete
call is the single most expensive part of the operation!
Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo
query.
We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.
If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.
To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With