Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark and Nifi Integration

I want to send Nifi flowfile to Spark and do some transformations in Spark and again send the result back to Nifi so that I can to further operations in Nifi. I don't want to write the flowfile written to database or HDFS and then trigger Spark job. I want to send flowfile directly to Spark and receive the result directly from Spark to Nifi. I tried using ExecuteSparkInteractive processor in Nifi but I am stuck. Any examples would be helpful

like image 574
Gowthaman V Avatar asked Oct 31 '18 06:10

Gowthaman V


People also ask

Does Apache NiFi use Spark?

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. When paired with the CData JDBC Driver for Spark, NiFi can work with live Spark data. This article describes how to connect to and query Spark data from an Apache NiFi Flow.

What is NiFi and Spark?

NiFi offers highly configurable and secure data flow between software all around the world. Other features include data provenance, efficient data buffering, flow specific QoS, and parallel streaming capabilities. On the other hand, Spark speeds up the computation process, regardless of the language.

Is Apache NiFi an ETL tool?

Apache NiFi is an ETL tool with flow-based programming that comes with a web UI built to provide an easy way (drag & drop) to handle data flow in real-time. It also supports powerful and scalable means of data routing and transformation, which can be run on a single server or in a clustered mode across many servers.

What is Apache spark integration?

Apache Spark is a unified analytics engine for large-scale data processing with built-in modules for SQL, streaming, machine learning, and graph processing. Spark can run on Apache Hadoop, Apache Mesos, Kubernetes, on its own, in the cloud—and against diverse data sources.


2 Answers

You can't send data directly to spark unless it is spark streaming. If it is traditional Spark with batch execution, then Spark needs to read the data from some type of storage like HDFS. The purpose of ExecuteSparkInteractive is to trigger a Spark job to run on data that has been delivered to HDFS.

If you want to go the streaming route then there are two options...

1) Directly integrate NiFi with Spark streaming

https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark

2) Use Kafka to integrate NiFi and Spark

NiFi writes to a Kafka topic, Spark reads from a Kafka topic, Spark writes back to a Kafka topic, NiFi reads from a Kafka topic. This approach would probably be the best option.

like image 111
Bryan Bende Avatar answered Sep 30 '22 13:09

Bryan Bende


This might help :

you can do everything in Nifi by following below steps :-

  1. Use ListSFTP to list files from Landing location.
  2. Use UpdateAttribute processor and assign absolute file path to a variable. Use this vaiable in your spark code as processor in next step support Expression language.
  3. Use ExecuteSparkInteractive processor, here you can write spark code (using python or scala or Java) and you can read your input file from landing location (use absolute path variable from step 2) without it being flowing as a Nifi flow file and perform operation/transformation on that file ( use spark.read... to read file into rdd). YOu may right your output to either hive external table or temp hdfs location.
  4. use FetchSFTP processor to read file from temp hdfs location and continue with your further Nifi operations.

Here, you need Livy setup to run spark code from Nifi (through ExecuteSparkINteractive). You may look at how to setup Livy and nifi controller services needed to use livy within Nifi.

Good Luck!!

like image 34
Ajay Ahuja Avatar answered Sep 30 '22 12:09

Ajay Ahuja