I am creating a demo pipeline to load a CSV file into BigQuery with Dataflow using my free google account. This is what I am facing.
When I read from a GCS file and just log the data, this works perfectly. below is my sample code.
This code runs okay
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject("project12345");
options.setStagingLocation("gs://mybucket/staging");
options.setRunner(DataflowRunner.class);
DataflowRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("gs://mybucket/charges.csv")).apply(ParDo.of(new DoFn<String, Void>() {
@ProcessElement
public void processElement(ProcessContext c) {
LOG.info(c.element());
}
}));
However, when I add a temp folder location with a path to a bucket I created, I get an error, below is my code.
LOG.debug("Starting Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject("project12345");
options.setStagingLocation("gs://mybucket/staging");
options.setTempLocation("gs://project12345/temp");
options.setJobName("csvtobq");
options.setRunner(DataflowRunner.class);
DataflowRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
boolean isStreaming = false;
TableReference tableRef = new TableReference();
tableRef.setProjectId("project12345");
tableRef.setDatasetId("charges_data");
tableRef.setTableId("charges_data_id");
p.apply("Loading Data from GCS", TextIO.read().from("gs://mybucket/charges.csv"))
.apply("Convert to BiqQuery Table Row", ParDo.of(new FormatForBigquery()))
.apply("Write into Data in to Big Query",
BigQueryIO.writeTableRows().to(tableRef).withSchema(FormatForBigquery.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(isStreaming ? BigQueryIO.Write.WriteDisposition.WRITE_APPEND
: BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));
p.run().waitUntilFinish();
}
When I run this, I get the following error
Exception in thread "main" java.lang.IllegalArgumentException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:242)
at demobigquery.StarterPipeline.main(StarterPipeline.java:74)
Caused by: java.lang.IllegalArgumentException: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path, gs://project12345/temp.
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:247)
at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:228)
at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592)
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533)
at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
at com.sun.proxy.$Proxy15.getGcpTempLocation(Unknown Source)
at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:240)
Is this an issue with authentication?, because I am using JSON credentials as project owner from GCP via Eclipse Dataflow plugin.
Any help would be highly appreciated.
Looks like the error message thrown from[1]. The default GCS validator is implemented in[2]. As you can see Beam code also throws cause exception for the IllegalArgumentException. So you can check a stack further for an exception happened in GcsPathValidator.
[1] https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L278
[2]https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsPathValidator.java#L29
There could be multiple reasons for this:
You are not logged in with the right GCP project credentials - Either the wrong user (or there is no logged in user) or the wrong project is being logged into
Ensure that the GOOGLE_APPLICATION_CREDENTIALS environment variable is for the right user and project. If not obtain the right credentials using
gcloud auth application-default login
Download the json, and change the GOOGLE_APPLICATION_CREDENTIALS to the downloaded file. Restart your system and then try again
You could be logging into the right project with the right user ID, but the requisite permissions for bucket access might be absent. Ensure that you have the following accesses:
The URL you are trying does not exist or is misspelt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With