How can I read file as a stream from hdfs using Apache Spark Java? I don't want to read whole file, I want to have file stream in order to stop reading file when some condition is met, how can I do it with Apache Spark?
Use readStream. format("socket") from Spark session object to read data from the socket and provide options host and port where you want to stream data from.
Apache Spark Streaming is a scalable fault-tolerant streaming processing system that natively supports both batch and streaming workloads.
You can use streaming HDFS file using ssc method
val ssc = new StreamingContext(sparkConf, Seconds(batchTime))
val dStream = ssc.fileStream[LongWritable, Text, TextInputFormat]( streamDirectory, (x: Path) => true, newFilesOnly = false)
Using above api param filter Function to filter paths to process.
If your condition is not with file path/name and based on data, then you need to stop streaming context if condition satisfy.
For this you need to use thread implementation, 1) In one thread you need to keep checking for streaming context is stopped and if ssc stopped then notify other thread to wait and create new streaming context.
2) In second thread , you need to check for condition and if condition satisfy then stop streaming context.
Please let me know if you need to explanation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With