Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot use apache flink in amazon emr

I can not a start a yarn session of Apache Flink in Amazons EMR. The error message I get is

$ tar xvfj flink-0.9.0-bin-hadoop26.tgz
$ cd flink-0.9.0
$ ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
...
Diagnostics: File file:/home/hadoop/.flink/application_1439466798234_0008/flink-conf.yaml does not exist
java.io.FileNotFoundException: File file:/home/hadoop/.flink/application_1439466798234_0008/flink-conf.yaml does not exist
...

I am using Flink verision 0.9 and Amazons Hadoop version 4.0.0. Any ideas or hints?

The full log can be found here: https://gist.github.com/headmyshoulder/48279f06c1850c62c28c

like image 348
headmyshoulder Avatar asked Aug 13 '15 15:08

headmyshoulder


People also ask

What type of application is Apache Flink supported as in the EMR ecosystem?

Amazon EMR supports Flink as a YARN application so that you can manage resources along with other applications within a cluster. Flink-on-YARN allows you to submit transient Flink jobs, or you can create a long-running cluster that accepts multiple jobs and allocates resources according to the overall YARN reservation.

Is Flink better than spark?

Flink's low latency outperforms Spark consistently, even at higher throughput. Spark can achieve low latency with lower throughput, but increasing the throughput will also increase the latency.

Can you run TensorFlow on EMR?

TensorFlow is available with Amazon EMR release version 5.17. 0 and later. The following table lists the version of TensorFlow included in the latest release of the Amazon EMR 6. x series, along with the components that Amazon EMR installs with TensorFlow.

Can spark Mllib run on EMR?

You can install Spark on an Amazon EMR cluster along with other Hadoop applications, and it can also leverage the EMR file system (EMRFS) to directly access data in Amazon S3.


1 Answers

From the log:

The file system scheme is 'file'. This indicates that the specified Hadoop configuration path is wrong and the sytem is using the default Hadoop configuration values.The Flink YARN client needs to store its files in a distributed file system

Flink failed to read the Hadoop configuration files. They are either picked up from the environment variables, e.g. HADOOP_HOME, or you can set the configuration dir in the flink-conf.yaml before you execute your YARN command.

Flink needs to read the Hadoop configuration to know how to upload the Flink jar to the cluster file system such that the newly created YARN cluster can access it. If Flink fails to resolve the Hadoop configuration, it uses the local file system for uploading the jar. That means that the jar will be put on the machine you launch your cluster from. Thus, it won't be accessible from the Flink YARN cluster.

Please see the Flink configuration page for more information.

edit: On Amazong EMR, export HADOOP_CONF_DIR=/etc/hadoop/conf let's Flink discover the Hadoop configuration directory.

like image 80
mxm Avatar answered Sep 19 '22 21:09

mxm