Looks like google has released support for querying the datastore from dataflow/beam in python. I'm trying to get it to run locally but I'm running into some issues:
import apache_beam as beam
from apache_beam.io.datastore.v1.datastoreio import ReadFromDatastore
from gcloud import datastore
client = datastore.Client('my-project')
query = client.query(kind='Document')
options = get_options()
p = beam.Pipeline(options=options)
entities = p | 'read' >> ReadFromDatastore(project='my-project', query=query)
entities | 'write' >> beam.io.Write(beam.io.TextFileSink('gs://output.txt'))
p.run()
This is giving me a
AttributeError: 'Query' object has no attribute 'HasField' [while running 'read/Split Query']
I'm guessing that I'm passing in the wrong query object (there are 3-4 pip packages that you can import datastore from) but I can't figure out which one I'm supposed to pass in. In the tests they are passing in protobuf. Is that what I have to use? Can anyone show a simple example query using protobuf if that's what I have to do?
The wordcount example uses protobufs for the query.
Looks like you need something like:
from google.datastore.v1 import query_pb2
...
query = query_pb2.Query()
query.kind.add().name = 'Document'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With