I am trying to run a pipeline, which I was able to run successfully with DirectRunner
, on Google Cloud Dataflow. When I execute this Maven command:
mvn compile exec:java \
-Dexec.mainClass=com.example.Pipeline \
-Dexec.args="--project=project-name \
--stagingLocation=gs://bucket-name/staging/ \
... custom arguments ...
--runner=DataflowRunner"
I get the following error:
No Runner was specified and the DirectRunner was not found on the classpath.
[ERROR] Specify a runner by either:
[ERROR] Explicitly specifying a runner by providing the 'runner' property
[ERROR] Adding the DirectRunner to the classpath
[ERROR] Calling 'PipelineOptions.setRunner(PipelineRunner)' directly
I intentionally removed DirectRunner
from my pom.xml
and added this:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.0.0</version>
<scope>runtime</scope>
</dependency>
I went ahead and removed the <scope>
tag, then called options.setRunner(DataflowRunner.class)
, but it didn't help. Extending my own PipelineOptions
interface from DataflowPipelineOptions
did not solve the problem as well.
Looks like it ignores runner
option in a way I am not able to debug.
Update: Here is the full pom.xml
, in case that helps:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>dataflow</artifactId>
<version>0.1</version>
<dependencies>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
<version>2.0.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-jdbc</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.1.4.jre7</version>
</dependency>
</dependencies>
</project>
Cloud Dataflow Runner prerequisites and setupSelect or create a Google Cloud Platform Console project. Enable billing for your project. Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager.
The Apache Beam model provides useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding datasets, and other such tasks. Dataflow fully manages these low-level details.
Forgetting to pass my PipelineOptions instance as a parameter to Pipeline.create()
method was the cause of my problem.
PipelineOptionsFactory.register(MyOptions.class);
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options); // Don't forget the options argument.
...
pipeline.run();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With