Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid GCS URI used for staging location

When starting a dataflow job (v.2.4.0) via a jar with all dependencies included, instead of using the provided GCS path, it seems that a gs:/ folder is created locally, and because of this the dataflow workers try to access <localjarfolderpath>/gs:/... instead of the real GCS path gs://... If I'm correct this was not the case for dataflow 1.x.x.

Example command:

java -cp 0.1-1.0-SNAPSHOT-jar-with-dependencies.jar Main --stagingLocation=gs://test/staging/

Error on cloud console:

Staged package 0.1-1.0-SNAPSHOT-jar-with-dependencies-89nvLkMzfT53iBBXlpW_oA.jar at location <localjarfolderpath>/gs:/test/staging/ is inaccessible. ... The pattern must be of the form "gs://<bucket>/path/to/file".

like image 991
bjorndv Avatar asked Apr 04 '18 15:04

bjorndv


1 Answers

I managed to solve it, by not using the maven-assembly-plugin for constructing the jar with dependencies. When using the maven-dependency-plugin with the maven-jar-plugin to create the jar-with-dependencies, the staging path is constructed correctly and Dataflow successfully starts the job. For reference, here's my maven jar build:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-dependency-plugin</artifactId>
    <executions>
        <execution>
            <id>copy-dependencies</id>
            <phase>prepare-package</phase>
            <goals>
                <goal>copy-dependencies</goal>
            </goals>
            <configuration>
                <outputDirectory>${project.build.directory}/lib</outputDirectory>
                <overWriteReleases>false</overWriteReleases>
                <overWriteSnapshots>false</overWriteSnapshots>
                <overWriteIfNewer>true</overWriteIfNewer>
            </configuration>
        </execution>
    </executions>
</plugin>
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-jar-plugin</artifactId>
    <configuration>
        <archive>
            <manifest>
                <addClasspath>true</addClasspath>
                <classpathPrefix>lib/</classpathPrefix>
                <mainClass>com.package.main</mainClass>
            </manifest>
        </archive>
    </configuration>
</plugin>

Since the jar contains a manifest entry to the classpath, you can start the job with:

java -jar my-dataflow-job.jar

Note that the jar and the lib containing all dependencies must be in the same directory.

Update: I noticed that the java -jar command does not always set the classpath correctly, even though it is defined in the manifest. The following command should work if you have trouble using the java -jar command:

java -cp "my-dataflow-job.jar:lib/*" org.company.dataflow.Main

Update 2: Together with @IvanPlantevin, I found out what the real problem is. What triggered us is this post. The problem is the way the maven-assembly-plugin packages the jar. In the manifest, under services, not all FileSystemRegistrars are included. In our case, it missed the GcsFileSystemRegistrar. We've fixed the problem, by using the maven-shade-plugin with the ServicesResourceTransformer. The solution below is that really addresses the problem. The solutions above are merely a workaround. This is our current build:

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>3.1.1</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <transformers>
                            <!-- NOTE! Don't forget the ServicesResourceTransformer, otherwise other file system registrars are not added to the jar! -->
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                            <transformer
                                    implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                <mainClass>com.package.main</mainClass>
                            </transformer>
                        </transformers>
                        <shadedArtifactAttached>true</shadedArtifactAttached>
                        <shadedClassifierName>runner</shadedClassifierName>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Finally, you can start it the regular way: java -jar my-dataflow-job.jar

like image 185
Robin Trietsch Avatar answered Jan 02 '23 21:01

Robin Trietsch