Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark read file as a stream from HDFS

How can I read file as a stream from hdfs using Apache Spark Java? I don't want to read whole file, I want to have file stream in order to stop reading file when some condition is met, how can I do it with Apache Spark?

like image 500
Maksym Avatar asked Jan 28 '17 10:01

Maksym


People also ask

How does Spark read streaming data?

Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.

Does Apache spark support stream processing?

Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.


1 Answers

You can use streaming HDFS file using ssc method

val ssc = new StreamingContext(sparkConf, Seconds(batchTime))

val dStream = ssc.fileStream[LongWritable, Text, TextInputFormat]( streamDirectory, (x: Path) => true, newFilesOnly = false)

Using above api param filter Function to filter paths to process.

If your condition is not with file path/name and based on data, then you need to stop streaming context if condition satisfy.

For this you need to use thread implementation, 1) In one thread you need to keep checking for streaming context is stopped and if ssc stopped then notify other thread to wait and create new streaming context.

2) In second thread , you need to check for condition and if condition satisfy then stop streaming context.

Please let me know if you need to explanation.

like image 58
Hutashan Chandrakar Avatar answered Oct 03 '22 14:10

Hutashan Chandrakar