Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Beam: Cannot find DataflowRunner

Tags:

I am trying to run a pipeline, which I was able to run successfully with DirectRunner, on Google Cloud Dataflow. When I execute this Maven command:

mvn compile exec:java \
    -Dexec.mainClass=com.example.Pipeline \
    -Dexec.args="--project=project-name \
    --stagingLocation=gs://bucket-name/staging/ \
    ... custom arguments ...
    --runner=DataflowRunner"

I get the following error:

No Runner was specified and the DirectRunner was not found on the classpath.
[ERROR] Specify a runner by either:
[ERROR]     Explicitly specifying a runner by providing the 'runner' property
[ERROR]     Adding the DirectRunner to the classpath
[ERROR]     Calling 'PipelineOptions.setRunner(PipelineRunner)' directly

I intentionally removed DirectRunner from my pom.xml and added this:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
    <version>2.0.0</version>
    <scope>runtime</scope>
</dependency>

I went ahead and removed the <scope> tag, then called options.setRunner(DataflowRunner.class), but it didn't help. Extending my own PipelineOptions interface from DataflowPipelineOptions did not solve the problem as well.

Looks like it ignores runner option in a way I am not able to debug.

Update: Here is the full pom.xml, in case that helps:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>dataflow</artifactId>
    <version>0.1</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-core</artifactId>
            <version>2.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-runners-google-cloud-dataflow-java</artifactId>
            <version>2.0.0</version>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-jdbc</artifactId>
            <version>2.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.beam</groupId>
            <artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
            <version>2.0.0</version>
        </dependency>
        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <version>42.1.4.jre7</version>
        </dependency>
    </dependencies>
</project>
like image 726
Utku Avatar asked Aug 05 '17 22:08

Utku


People also ask

How do I run Apache Beam on dataflow?

Cloud Dataflow Runner prerequisites and setupSelect or create a Google Cloud Platform Console project. Enable billing for your project. Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager.

Is dataflow same as Apache Beam?

The Apache Beam model provides useful abstractions that insulate you from low-level details of distributed processing, such as coordinating individual workers, sharding datasets, and other such tasks. Dataflow fully manages these low-level details.


1 Answers

Forgetting to pass my PipelineOptions instance as a parameter to Pipeline.create() method was the cause of my problem.

PipelineOptionsFactory.register(MyOptions.class);
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline pipeline = Pipeline.create(options); // Don't forget the options argument.
...
pipeline.run();
like image 102
Utku Avatar answered Sep 28 '22 00:09

Utku