I'm working on a recommender system using Apache Flink. The implementation is running when I test it in IntelliJ, but I would like now to go on a cluster. I also built a jar file and tested it locally to see if all was working but I encountered a problem.
java.lang.NoClassDefFoundError: org/apache/flink/ml/common/FlinkMLTools$
As we can see, the class FlinkMLTools
used in my code isn't found during the running of the jar.
I built this jar with Maven 3.3.3 with mvn clean install
and I'm using the version 0.9.0 of Flink.
First Trail
The fact is that my global project contains other projects (and this recommender is one of the sub-project). In that way, I have to launch the mvn clean install
in the folder of the right project, otherwise Maven always builds a jar of an other project (and I don't understand why). So I'm wondering if there could be a way to say explicitly to maven to build one specific project of the global project. Indeed, perhaps the path to FlinkMLTools
is contained in a link present in the pom.xml
file of the global project.
Any other ideas?
The problem is that Flink's binary distribution does not contain the libraries (flink-ml, gelly, etc.). This means that you either have to ship the library jar files with your job jar or that you have to copy them manually to your cluster. I strongly recommend the first option.
The easiest way to build a fat jar which does not contain unnecessary jars is to use Flink's quickstart archetype to set up the project's pom.
mvn archetype:generate -DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=0.9.0
will create the structure for a Flink project using the Scala API. The generated pom file will have the following dependencies.
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>0.9.0</version>
</dependency>
</dependencies>
You can remove flink-streaming-scala
and instead you insert the following dependency tag in order to include Flink's machine learning library.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-ml</artifactId>
<version>0.9.0</version>
</dependency>
When you know build the job jar with mvn package
, the generated jar should contain the flink-ml
jar and all of its transitive dependencies.
Flink includes all jars which are located in the <FLINK_ROOT_DIR>/lib
folder in the classpath of the executed jobs. Thus, in order to use Flink's machine learning library you have to put the flink-ml
jar and all needed transitive dependencies into the /lib
folder. This is rather tricky, since you have to figure out which transitive dependencies are actually needed by your algorithm and, consequently, you will often end up copying all transitive dependencies.
In order to build a specific sub-module X from your parent project you can use the following command:
mvn clean package -pl X -am
-pl
allows you to specify which sub-modules you want to build and -am
tells maven to also build other required sub-modules. It is also described here.
In cluster mode, Flink does not put all library JAR files into the classpath of its workers. When executing the program locally in IntelliJ all required dependencies are in the classpath, but not when executing on a cluster.
You have two options:
lib
folder of all Flink TaskManagerSee the Cluster Execution Documentation for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With