Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a text file as Spark streaming source for testing purpose

I want to write a test for my spark streaming application that consume a flume source.

http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/ suggests using ManualClock but for the moment reading a file and verifying outputs would be enough for me.

So I wish to use :

JavaStreamingContext streamingContext = ...
JavaDStream<String> stream = streamingContext.textFileStream(dataDirectory);
stream.print();
streamingContext.awaitTermination();
streamingContext.start();

Unfortunately it does not print anything.

I tried:

  • dataDirectory = "hdfs://node:port/absolute/path/on/hdfs/"
  • dataDirectory = "file://C:\\absolute\\path\\on\\windows\\"
  • adding the text file in the directory BEFORE the program begins
  • adding the text file in the directory WHILE the program run

Nothing works.

Any suggestion to read from text file?

Thanks,

Martin

like image 462
Martin Avatar asked Jul 08 '15 13:07

Martin


1 Answers

Order of start and await are indeed inversed.

In addition to that, the easiest way to pass data to your Spark Streaming application for testing is a QueueDStream. It's a mutable queue of RDD of arbitrary data. This means that you could create the data programmatically or load it from disk into an RDD and pass that to your Spark Streaming code.

Eg. to avoid the timing issues faced with the fileConsumer, you could try this:

val rdd  = sparkContext.textFile(...)
val rddQueue: Queue[RDD[String]] = Queue()
rddQueue += rdd
val dstream = streamingContext.queueStream(rddQueue)
doMyStuffWithDstream(dstream)
streamingContext.start()
streamingContext.awaitTermination()
like image 192
maasg Avatar answered Oct 18 '22 23:10

maasg