I have a popular social game, for which I was storing every user action as an Event entity in the data store and then once per day exporting these events from GAE for processing. Now I'd like to delete all of these entities from the data store, all 300 million of them.
I started using the "datastore admin" for this. In the first few hours I let it run, it managed to delete about 2 million entities and use $10 of CPU to do this. So it seems it will cost $1500 just to run this delete. I'm starting to think I would be better off just deleting the whole application and copy items I don't want to delete over to the new application.
Any other options?
Update
I got advice on #appengine in IRC that simply getting the keys of 2000 entities at a time and spawning tasks to delete them in pieces (can pass keys as strings to tasks) may be cheaper than using the Datastore Admin tool. I am trying this now and will try to remember to report back tomorrow if this seems to be cheaper or not.
Removing indexes first will substantially reduce the cost of deleting records. You could also limit your deletion rate to keep under the 6.5 CPU hours per day you're allocated for free, to avoid paying.
Regarding doing the deletion yourself, it's unlikely you can do it more efficiently than the datastore admin tool, which already does keys-only queries and uses cursors.
Presuming your goal is to delete for the cheapest cost, rather than face migration, I would launch a self-scheduling task, that would clean up several hundred/thousand keys in each run, then re-schedules itself. I would purposefully throttle this task on a separate queue so that the task ran only often enough to avoid hitting my daily free quotas if possible.
If you have some "acceptable" CPU cost per day that you'd be willing to live with, you could throttle the queue less, either way it would be better if each delete doesn't take too long to complete so it doesn't impact instance latency too heavily.
kevpie's comment on vacuuming may also be relevant if you have composite indexes.
I've run into this problem while developing a new app that didn't always clean up after itself. After running several experiments, what I learned was
Tune the number of deletions you do so that you do as many as possible without getting timeouts. For me, that was around 200 entities per chunk, though sometimes I could get away with 400, but never more.
If you can query in such a way that you can get the keys only, it is more efficient.
My biggest efficiency gains were from using Query Cursors. These enabled more efficient queries with bigger chunk sizes.
Use scheduling to spread the pain. This can be automatic cron scheduling or manual scheduling. Since my project was still alpha, I would check my quotas late in the evening and run out unused quota by starting my deletion tasks manually.
Complain to Google. This had the effect of making me feel better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With