Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Gradle Support for GCP Dataflow Templates?

According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow are pretty far apart in terms of details, I have some specific code requirements that lock me into the 2.0.0r3 codebase, so I'm pretty much required to use Apache Beam. Apache is -- understandably -- quite dedicated to Maven, but institutionally my company's thrown the bulk of its weight behind Gradle, so much so that they migrated all their Java projects over to it last year and have pushed back against re-introducing it.

However, now we seem to be at an impasse, because we've got a specific goal to try to centralize a lot of our back-end gathering in GCP's Dataflow, and GCP Dataflow doesn't appear to have formal support for Gradle. If it does, it's not in the official documentation.

Is there a sufficient technical basis to actually build Dataflow templates with Gradle and the issue is that Google's docs simply haven't been updated to support this? Is there a technical reason why Gradle can't do what's being done with Maven? Is there a better guide for working with GCP Dataflow than the docs on Google's and Apache's websites? I haven't worked with Maven archetypes before, and all the searches I've done for "gradle archetypes" turn up results from, at best, over a year ago. Most of the information points to forum discussions from 2014 and version 1.7rc3, but we're on 3.5. This feels like it ought to be a solved problem, but for the life of me I can't find any current information on this online.

like image 790
KristinaTracer Avatar asked Apr 28 '17 02:04

KristinaTracer


2 Answers

Commandline to Run Cloud Dataflow Job With Gradle

Generic Execution

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner

Specific Example

$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner

Explanation of Commandline

  1. gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.

  2. -DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.

  3. -Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.

  4. --runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner

  5. --spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.

build.gradle contents (NOTE: You need to populate your specific dependencies)

apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'

group = 'com.foo.bar'
version = '0.3'

mainClassName = System.getProperty("mainClass")

sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {

     maven { url "https://repository.apache.org/content/repositories/snapshots/" }
     maven { url "http://repo.maven.apache.org/maven2" }
}

dependencies {
    compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
    // Insert your build deps for your Beam Dataflow project here
    runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
    runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}

task execute (type:JavaExec) {
    main = System.getProperty("mainClass")
    classpath = sourceSets.main.runtimeClasspath
    systemProperties System.getProperties()
    args System.getProperty("exec.args").split()
}

Explanation of build.gradle

  1. We use the task execute (type:JavaExec) in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specific PipelineOptions). more here

  2. The line of build.gradle that reads runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0' is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.

like image 160
eb80 Avatar answered Oct 11 '22 11:10

eb80


There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.

Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://... parameters.

It's just a runnable Java application.

The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.

You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.

Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.

like image 28
Graham Polley Avatar answered Oct 11 '22 10:10

Graham Polley