Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

Question

I'm trying to export google cloud datastore data to Avro files in google cloud storage and then load those files into BigQuery.

Firstly, I know that Big Query loads datastore backups. This has several disadvantages that I'd like to avoid:

Backup tool is closed source
Backup tool format is undocumented.
Backup tool format cannot be read directly by Dataflow
Backup scheduling for appengine is in (apparently perpetual) alpha.
It is possible to implement your own backup handler in appengine, but it is fire and forget. You won't know when exactly the backup has finished or what the file name will be.

With the motivation clarified for this experiment here is my Dataflow Pipeline to export the data to avro format:

package com.example.dataflow;

import com.google.api.services.datastore.DatastoreV1;
import com.google.api.services.datastore.DatastoreV1.Entity;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.coders.AvroCoder;
import com.google.cloud.dataflow.sdk.io.AvroIO;
import com.google.cloud.dataflow.sdk.io.DatastoreIO;
import com.google.cloud.dataflow.sdk.io.Read;
import com.google.cloud.dataflow.sdk.options.DataflowPipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory;
import com.google.cloud.dataflow.sdk.transforms.DoFn;
import com.google.cloud.dataflow.sdk.transforms.ParDo;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.file.SeekableByteArrayInput;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.protobuf.ProtobufData;
import org.apache.avro.protobuf.ProtobufDatumWriter;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.ByteArrayOutputStream;

public class GCDSEntitiesToAvroSSCCEPipeline {

    private static final String GCS_TARGET_URI = "gs://myBucket/datastore/dummy";
    private static final String ENTITY_KIND = "Dummy";

    private static Schema getSchema() {
        return ProtobufData.get().getSchema(Entity.class);
    }

    private static final Logger LOG = LoggerFactory.getLogger(GCDSEntitiesToAvroSSCCEPipeline.class);
    public static void main(String[] args) {
        PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().create();
        Pipeline p = Pipeline.create(options);

        DatastoreV1.Query.Builder q = DatastoreV1.Query.newBuilder()
                .addKind(DatastoreV1.KindExpression.newBuilder().setName(ENTITY_KIND));

        p.apply(Read.named("DatastoreQuery").from(DatastoreIO.source()
                .withDataset(options.as(DataflowPipelineOptions.class).getProject())
                .withQuery(q.build())))
            .apply(ParDo.named("ProtoBufToAvro").of(new ProtoBufToAvro()))
            .setCoder(AvroCoder.of(getSchema()))
            .apply(AvroIO.Write.named("WriteToAvro")
                    .to(GCS_TARGET_URI)
                    .withSchema(getSchema())
                    .withSuffix(".avro"));
        p.run();

    }

    private static class ProtoBufToAvro extends DoFn<Entity, GenericRecord> {
        private static final long serialVersionUID = 1L;

        @Override
        public void processElement(ProcessContext c) throws Exception {
            Schema schema = getSchema();
            ProtobufDatumWriter<Entity> pbWriter = new ProtobufDatumWriter<>(Entity.class);
            DataFileWriter<Entity> dataFileWriter = new DataFileWriter<>(pbWriter);
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            dataFileWriter.create(schema, bos);
            dataFileWriter.append(c.element());
            dataFileWriter.close();

            DatumReader<GenericRecord> datumReader = new GenericDatumReader<>(schema);
            DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(
                    new SeekableByteArrayInput(bos.toByteArray()), datumReader);

            c.output(dataFileReader.next());

        }
    }
}

The pipeline runs fine, however when I try to load the resultant Avro file into big query I get the following error:

bq load --project_id=roodev001 --source_format=AVRO dummy.dummy_1 gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro
Waiting on bqjob_r5c9b81a49572a53b_00000154951eb523_1 ... (0s) Current status: DONE   
BigQuery error in load operation: Error processing job 'roodev001:bqjob_r5c9b81a49572a53b_00000154951eb523_1': The Apache Avro library failed to parse file
gs://roodev001.appspot.com/datastore/dummy-00000-of-00001.avro.

However if I load the resultant avro file with avro tool, everything is just fine:

avro-tools tojson datastore-dummy-00000-of-00001.avro | head
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
{"key":{"com.google.api.services.datastore.DatastoreV1$.Key":{"partition_id":{"com.google.api.services.datastore.DatastoreV1$.PartitionId":{"dataset_id":"s~roodev001","namespace":""}},"path_element":[{"kind":"Dummy","id":4503905778008064,"name":""}]}},"property":[{"name":"number","value":{"boolean_value":false,"integer_value":879,"double_value":0.0,"timestamp_microseconds_value":0,"key_value":null,"blob_key_value":"","string_value":"","blob_value":"","entity_value":null,"list_value":[],"meaning":0,"indexed":true}}]}
...

I used this code to populate the datastore with dummy data before running the Dataflow pipeline:

package com.example.datastore;

import com.google.gcloud.AuthCredentials;
import com.google.gcloud.datastore.*;

import java.io.IOException;

public static void main(String[] args) throws IOException {

    Datastore datastore = DatastoreOptions.builder()
            .projectId("myProjectId")
            .authCredentials(AuthCredentials.createApplicationDefaults())
            .build().service();

    KeyFactory dummyKeyFactory = datastore.newKeyFactory().kind("Dummy");


    Batch batch = datastore.newBatch();
    int batchCount = 0;
    for (int i = 0; i < 4000; i++){
        IncompleteKey key = dummyKeyFactory.newKey();
        System.out.println("adding entity " + i);
        batch.add(Entity.builder(key).set("number", i).build());
        batchCount++;
        if (batchCount > 99) {
            batch.submit();
            batch = datastore.newBatch();
            batchCount = 0;
        }
    }

    System.out.println("done");

}

So why is BigQuery rejecting my avro files?

Hua Zhang · Accepted Answer

BigQuery uses the C++ Avro library, and apparently it doesn't like the "$" in the namespace. Here's the error message:

Invalid namespace: com.google.api.services.datastore.DatastoreV1$

We're working on getting these Avro error messages out to the end user.

Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

Tags:

google-app-engine

google-cloud-datastore

avro

google-bigquery

google-cloud-dataflow

Frank Wilson

Video Answer

1 Answers

Hua Zhang

Recent Activity

Donate For Us

Why does BigQuery fail to parse an Avro file that is accepted by avro-tools?

Tags:

google-app-engine

google-cloud-datastore

avro

google-bigquery

google-cloud-dataflow

Frank Wilson

Video Answer

1 Answers

Hua Zhang

Related questions

Recent Activity

Donate For Us