Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommended strategies for backing up appengine datastore

Right now I use remote_api and appcfg.py download_data to take a snapshot of my database every night. It takes a long time (6 hours) and is expensive. Without rolling my own change-based backup (I'd be too scared to do something like that), what's the best option for making sure my data is safe from failure?

PS: I recognize that Google's data is probably way safer than mine. But what if one day I accidentally write a program that deletes it all?

like image 717
Riley Lark Avatar asked Nov 09 '11 23:11

Riley Lark


People also ask

Which of the following is the new version of Google Cloud Datastore that has certain improvements over the existing one?

Google has released Firestore, a new version of Datastore with several improvements and additional features. Existing Datastore users can access these features by cheating a database using “Firestore in Datastore” mode.

What is the correct command line structure to export data from Datastore?

Use the gcloud datastore export command to export all entities in your database. where bucket-name is the name of your Cloud Storage bucket and an optional prefix, for example, bucket-name /datastore-exports/export-name . You cannot re-use the same prefix for another export operation.


1 Answers

I think you've pretty much identified all of your choices.

  1. Trust Google not to lose your data, and hope you don't accidentally instruct them to destroy it.
  2. Perform full backups with download_data, perhaps less frequently than once per night if it is prohibitively expensive.
  3. Roll your own incremental backup solution.

Option 3 is actually an interesting idea. You'd need a modification timestamp on all entities, and you wouldn't catch deleted entities, but otherwise it's very doable with remote_api and cursors.

Edit:

Here's a simple incremental downloader for use with remote_api. Again, the caveats are that it won't notice deleted entities, and it assumes all entities store the last modification time in a property named updated_at. Use it at your own peril.

import os
import hashlib
import gzip
from google.appengine.api import app_identity
from google.appengine.ext.db.metadata import Kind
from google.appengine.api.datastore import Query
from google.appengine.datastore.datastore_query import Cursor

INDEX = 'updated_at'
BATCH = 50
DEPTH = 3

path = ['backups', app_identity.get_application_id()]
for kind in Kind.all():
  kind = kind.kind_name
  if kind.startswith('__'):
    continue
  while True:
    print 'Fetching %d %s entities' % (BATCH, kind)
    path.extend([kind, 'cursor.txt'])
    try:
      cursor = open(os.path.join(*path)).read()
      cursor = Cursor.from_websafe_string(cursor)
    except IOError:
      cursor = None
    path.pop()
    query = Query(kind, cursor=cursor)
    query.Order(INDEX)
    entities = query.Get(BATCH)
    for entity in entities:
      hash = hashlib.sha1(str(entity.key())).hexdigest()
      for i in range(DEPTH):
        path.append(hash[i])
      try:
        os.makedirs(os.path.join(*path))
      except OSError:
        pass
      path.append('%s.xml.gz' % entity.key())
      print 'Writing', os.path.join(*path)
      file = gzip.open(os.path.join(*path), 'wb')
      file.write(entity.ToXml())
      file.close()
      path = path[:-1-DEPTH]
    if entities:
      path.append('cursor.txt')
      file = open(os.path.join(*path), 'w')
      file.write(query.GetCursor().to_websafe_string())
      file.close()
      path.pop()
    path.pop()
    if len(entities) < BATCH:
      break
like image 104
Drew Sears Avatar answered Oct 22 '22 19:10

Drew Sears