Have a GAE datastore kind with several 100'000s of objects in them. Want to do several involved queries (involving counting queries). Big Query seems a god fit for doing this. Is there currently an easy way to query a live AppEngine Datastore using Big Query?

With the new (from September 2013) streaming inserts api you can import records from your app into BigQuery. The data is available in BigQuery immediately so this should satisfy your live requirement. Whilst this question is now a bit old, this may be an easier solution for anyone stumbling across this question At the moment though getting this to work from a the local dev server is patchy at best.

Google App Engine: Using Big Query on datastore?

2 Answers

You can't run a BigQuery directly on DataStore entities, but you can write a Mapper Pipeline that reads entities out of DataStore, writes them to CSV in Google Cloud Storage, and then ingests those into BigQuery - you can even automate the process. Here's an example of using the Mapper API classes for just the DataStore to CSV step:

import re
import time
from datetime import datetime
import urllib
import httplib2
import pickle

from google.appengine.ext import blobstore
from google.appengine.ext import db
from google.appengine.ext import webapp

from google.appengine.ext.webapp.util import run_wsgi_app
from google.appengine.ext.webapp import blobstore_handlers
from google.appengine.ext.webapp import util
from google.appengine.ext.webapp import template

from mapreduce.lib import files
from google.appengine.api import taskqueue
from google.appengine.api import users

from mapreduce import base_handler
from mapreduce import mapreduce_pipeline
from mapreduce import operation as op

from apiclient.discovery import build
from google.appengine.api import memcache
from oauth2client.appengine import AppAssertionCredentials


#Number of shards to use in the Mapper pipeline
SHARDS = 20

# Name of the project's Google Cloud Storage Bucket
GS_BUCKET = 'your bucket'

# DataStore Model
class YourEntity(db.Expando):
  field1 = db.StringProperty() # etc, etc

ENTITY_KIND = 'main.YourEntity'


class MapReduceStart(webapp.RequestHandler):
  """Handler that provides link for user to start MapReduce pipeline.
  """
  def get(self):
    pipeline = IteratorPipeline(ENTITY_KIND)
    pipeline.start()
    path = pipeline.base_path + "/status?root=" + pipeline.pipeline_id
    logging.info('Redirecting to: %s' % path)
    self.redirect(path)


class IteratorPipeline(base_handler.PipelineBase):
  """ A pipeline that iterates through datastore
  """
  def run(self, entity_type):
    output = yield mapreduce_pipeline.MapperPipeline(
      "DataStore_to_Google_Storage_Pipeline",
      "main.datastore_map",
      "mapreduce.input_readers.DatastoreInputReader",
      output_writer_spec="mapreduce.output_writers.FileOutputWriter",
      params={
          "input_reader":{
              "entity_kind": entity_type,
              },
          "output_writer":{
              "filesystem": "gs",
              "gs_bucket_name": GS_BUCKET,
              "output_sharding":"none",
              }
          },
          shards=SHARDS)


def datastore_map(entity_type):
  props = GetPropsFor(entity_type)
  data = db.to_dict(entity_type)
  result = ','.join(['"%s"' % str(data.get(k)) for k in props])
  yield('%s\n' % result)


def GetPropsFor(entity_or_kind):
  if (isinstance(entity_or_kind, basestring)):
    kind = entity_or_kind
  else:
    kind = entity_or_kind.kind()
  cls = globals().get(kind)
  return cls.properties()


application = webapp.WSGIApplication(
                                     [('/start', MapReduceStart)],
                                     debug=True)

def main():
  run_wsgi_app(application)

if __name__ == "__main__":
  main()

If you append this to the end of your IteratorPipeline class: yield CloudStorageToBigQuery(output), you can pipe the resulting csv filehandle into a BigQuery ingestion pipe... like this:

class CloudStorageToBigQuery(base_handler.PipelineBase):
  """A Pipeline that kicks off a BigQuery ingestion job.
  """
  def run(self, output):

# BigQuery API Settings
SCOPE = 'https://www.googleapis.com/auth/bigquery'
PROJECT_ID = 'Some_ProjectXXXX'
DATASET_ID = 'Some_DATASET'

# Create a new API service for interacting with BigQuery
credentials = AppAssertionCredentials(scope=SCOPE)
http = credentials.authorize(httplib2.Http())
bigquery_service = build("bigquery", "v2", http=http)

jobs = bigquery_service.jobs()
table_name = 'datastore_dump_%s' % datetime.utcnow().strftime(
    '%m%d%Y_%H%M%S')
files = [str(f.replace('/gs/', 'gs://')) for f in output]
result = jobs.insert(projectId=PROJECT_ID,
                    body=build_job_data(table_name,files)).execute()
logging.info(result)

def build_job_data(table_name, files):
  return {"projectId": PROJECT_ID,
          "configuration":{
              "load": {
                  "sourceUris": files,
                  "schema":{
                      # put your schema here
                      "fields": fields
                      },
                  "destinationTable":{
                      "projectId": PROJECT_ID,
                      "datasetId": DATASET_ID,
                      "tableId": table_name,
                      },
                  }
              }
          }

answered Oct 04 '22 06:10

Michael Manoochehri

With the new (from September 2013) streaming inserts api you can import records from your app into BigQuery.

The data is available in BigQuery immediately so this should satisfy your live requirement.

Whilst this question is now a bit old, this may be an easier solution for anyone stumbling across this question

At the moment though getting this to work from a the local dev server is patchy at best.

answered Oct 04 '22 06:10

danmux

Related questions
                            
                                Google App Engine - How do I split code into multiple files? (webapp)
                            
                                Read delay in App Engine Datastore after put()
                            
                                GAE webapp2 session: the correct process of creating and checking sessions
                            
                                using google app engine SDK in pycharm
                            
                                Unable to start App Engine application after updating it via Google Cloud SDK
                            
                                Best way to profile/optimize a website on Google App Engine
                            
                                How to populate my WTForm variables?
                            
                                Is there a way to use UTF-8 with app engine?
                            
                                How does Go handle concurrent request on Google App Engine
                            
                                Setting f1-micro resource limits in app.yaml for google cloud compute node.js app without vm_settings
                            
                                Using Google App Engine SDK with Python 2.7 on Mac OS X 10.6
                            
                                How to get offline token and refresh token and auto-refresh access to Google API
                            
                                App Engine: The private key you've selected does not appear to be valid
                            
                                ImportError: No module named flask_restful
                            
                                Best opensource IDE for building applications on Google App Engine? [closed]
                            
                                DevServer fails after updating to java 6u31
                            
                                Google App Engine Versioning in the Datastore
                            
                                Python 2.7 : How to use BeautifulSoup in Google App Engine? [duplicate]
                            
                                db.ReferenceProperty() vs ndb.KeyProperty in App Engine
                            
                                com.google.android.gms.auth.GoogleAuthException: getToken(Unknown Source) exception

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Google App Engine: Using Big Query on datastore?

Tags:

google-app-engine

google-bigquery

Allan D.

People also ask

2 Answers

Michael Manoochehri

danmux

Recent Activity

Donate For Us