Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do a mass db.delete on App Engine, without eating CPU

We've got a reasonably-sized database on Google App Engine - just over 50,000 entities - that we want to clear out stale data from. The plan was to write a deferred task to iterate over the entities we no longer wanted, and delete them in batches.

One complication is that our entities also have child entities that we also want to purge -- no problem, we thought; we'd just query the datastore for those entities, and drop them at the same time as the parent:

query = ParentKind.all()
query.count(100)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all().ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all().ancestor(entity).fetch(100))
db.delete(to_delete)

We limited ourselves to deleting 100 ParentKind entities at a time; each ParentKind had around 40 child ChildKindA and ChildKindB entities total - perhaps 4000 entities.

This seemed reasonable at the time, but we ran one batch as a test, and the resulting query took 9 seconds to run -- and spent 1933 seconds in billable CPU time accessing the datastore.

This seems pretty harsh -- 0.5 billable seconds per entity! -- but we're not entirely sure what we're doing wrong. Is it simply the size of the batch? Are ancestor queries particularly slow? Or, are deletes (and indeed, all datastore accesses) simply slow as molasses?

Update

We changed our queries to be keys_only, and while that reduced the time to run one batch to 4.5 real seconds, it still cost ~1900 seconds in CPU time.

Next, we installed Appstats to our app (thanks, kevpie) and ran a smaller sized batch -- 10 parent entities, which would amount to ~450 entities total. Here's the updated code:

query = ParentKind.all(keys_only=True)
query.count(10)
query.filter('bar =', 'foo')
to_delete = []
for entity in enumerate(query):
    to_delete.append(entity)
    to_delete.extend(ChildKindA.all(keys_only=True).ancestor(entity).fetch(100))
    to_delete.extend(ChildKindB.all(keys_only=True).ancestor(entity).fetch(100))
db.delete(to_delete)

The results from Appstats:

service.call           #RPCs  real time  api time
datastore_v3.RunQuery  22     352ms      555ms
datastore_v3.Delete    1      366ms      132825ms
taskqueue.BulkAdd      1      7ms        0ms

The Delete call is the single most expensive part of the operation!

Is there a way around this? Nick Johnson mentioned that using the bulk delete handler is the fastest way to delete at present, but ideally we don't want to delete all entities of a kind, just the ones that match, and are children of, our initial bar = foo query.

like image 223
Blair Holloway Avatar asked Dec 15 '10 08:12

Blair Holloway


2 Answers

We recently added a bulk-delete handler, documented here. It takes the most efficient possible approach to bulk deletion, though it still consumes CPU quota.

like image 194
Nick Johnson Avatar answered Sep 19 '22 13:09

Nick Johnson


If you want to spread out the CPU burn, you could create a map reduce job. It will still iterate over every entity (this is a current limitation of the mapper API). However, you can check if each entity meets the condition and delete or not at that time.

To slow down the CPU usage, assign the mapper to a task queue that you've configured to run slower than normal. You can spread the run time out over several days and not eat up all your CPU quota.

like image 22
Luke Francl Avatar answered Sep 21 '22 13:09

Luke Francl