When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the dataflow workers try to access <localjarfolderpath>/gs:/...
instead of the real GCS path gs://...
If I'm correct this was not the case for dataflow 1.x.x.
Example command:
java -cp 0.1-1.0-SNAPSHOT-jar-with-dependencies.jar Main --stagingLocation=gs://test/staging/
Error on cloud console:
Staged package 0.1-1.0-SNAPSHOT-jar-with-dependencies-89nvLkMzfT53iBBXlpW_oA.jar at location <localjarfolderpath>/gs:/test/staging/ is inaccessible. ... The pattern must be of the form "gs://<bucket>/path/to/file".
I managed to solve it, by not using the maven-assembly-plugin
for constructing the jar with dependencies. When using the maven-dependency-plugin
with the maven-jar-plugin
to create the jar-with-dependencies, the staging path is constructed correctly and Dataflow successfully starts the job. For reference, here's my maven jar build:
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<overWriteReleases>false</overWriteReleases>
<overWriteSnapshots>false</overWriteSnapshots>
<overWriteIfNewer>true</overWriteIfNewer>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<classpathPrefix>lib/</classpathPrefix>
<mainClass>com.package.main</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
Since the jar contains a manifest entry to the classpath, you can start the job with:
java -jar my-dataflow-job.jar
Note that the jar and the lib
containing all dependencies must be in the same directory.
Update:
I noticed that the java -jar
command does not always set the classpath correctly, even though it is defined in the manifest. The following command should work if you have trouble using the java -jar
command:
java -cp "my-dataflow-job.jar:lib/*" org.company.dataflow.Main
Update 2:
Together with @IvanPlantevin, I found out what the real problem is. What triggered us is this post. The problem is the way the maven-assembly-plugin
packages the jar. In the manifest, under services, not all FileSystemRegistrars
are included. In our case, it missed the GcsFileSystemRegistrar
. We've fixed the problem, by using the maven-shade-plugin
with the ServicesResourceTransformer
. The solution below is that really addresses the problem. The solutions above are merely a workaround. This is our current build:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<!-- NOTE! Don't forget the ServicesResourceTransformer, otherwise other file system registrars are not added to the jar! -->
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>com.package.main</mainClass>
</transformer>
</transformers>
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>runner</shadedClassifierName>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Finally, you can start it the regular way: java -jar my-dataflow-job.jar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With