Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark streaming DStream RDD to get file name

Spark streaming textFileStream and fileStream can monitor a directory and process the new files in a Dstream RDD.

How to get the file names that are being processed by the DStream RDD at that particular interval?

like image 790
Vijay Innamuri Avatar asked Mar 13 '15 11:03

Vijay Innamuri


People also ask

What is DStream in Spark Streaming?

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream.

What is difference between DStream and structured Streaming?

Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.

Can we perform basic operations of RDD in DStream?

We can use these operations on DStreams to perform underlying RDD operations on each batch. If stateless transformations are insufficient, DStreams comes with an advanced operator called transform(). transform() allow operating on the RDDs inside them.

What is DStream and RDD?

A DStream is logically a series of RDD s. Spark Streaming is just to hide the process of creating Seq[RDD] so it is not your job but the framework. Moreover, Spark Streaming gives you a much nicer developer API so you can think of Seq[RDD] as a DStream , but rather than rdds.


2 Answers

fileStream produces UnionRDD of NewHadoopRDDs. The good part about NewHadoopRDDs created by sc.newAPIHadoopFile is that their names are set to their paths.

Here's the example of what you can do with that knowledge:

def namedTextFileStream(ssc: StreamingContext, directory: String): DStream[String] =
  ssc.fileStream[LongWritable, Text, TextInputFormat](directory)
    .transform( rdd =>
      new UnionRDD(rdd.context,
        rdd.dependencies.map( dep =>
          dep.rdd.asInstanceOf[RDD[(LongWritable, Text)]].map(_._2.toString).setName(dep.rdd.name)
        )
      )
    )

def transformByFile[U: ClassTag](unionrdd: RDD[String],
                                 transformFunc: String => RDD[String] => RDD[U]): RDD[U] = {
  new UnionRDD(unionrdd.context,
    unionrdd.dependencies.map{ dep =>
      if (dep.rdd.isEmpty) None
      else {
        val filename = dep.rdd.name
        Some(
          transformFunc(filename)(dep.rdd.asInstanceOf[RDD[String]])
            .setName(filename)
        )
      }
    }.flatten
  )
}

def main(args: Array[String]) = {
  val conf = new SparkConf()
    .setAppName("Process by file")
    .setMaster("local[2]")

  val ssc = new StreamingContext(conf, Seconds(30))

  val dstream = namesTextFileStream(ssc, "/some/directory")

  def byFileTransformer(filename: String)(rdd: RDD[String]): RDD[(String, String)] =
    rdd.map(line => (filename, line))

  val transformed = dstream.
    transform(rdd => transformByFile(rdd, byFileTransformer))

  // Do some stuff with transformed

  ssc.start()
  ssc.awaitTermination()
}
like image 67
nonsleepr Avatar answered Oct 04 '22 18:10

nonsleepr


For those that want some Java code instead of Scala:

JavaPairInputDStream<LongWritable, Text> textFileStream = 
        jsc.fileStream(
            inputPath, 
            LongWritable.class, 
            Text.class,
            TextInputFormat.class, 
            FileInputDStream::defaultFilter,
            false
        );
JavaDStream<Tuple2<String, String>> namedTextFileStream = textFileStream.transform((pairRdd, time) -> {
        UnionRDD<Tuple2<LongWritable, Text>> rdd = (UnionRDD<Tuple2<LongWritable, Text>>) pairRdd.rdd();
        List<RDD<Tuple2<LongWritable, Text>>> deps = JavaConverters.seqAsJavaListConverter(rdd.rdds()).asJava();
        List<RDD<Tuple2<String, String>>> collectedRdds = deps.stream().map( depRdd -> {
            if (depRdd.isEmpty()) {
                return null;
            }
            JavaRDD<Tuple2<LongWritable, Text>> depJavaRdd = depRdd.toJavaRDD();
            String filename = depRdd.name();
            JavaPairRDD<String, String> newDep = JavaPairRDD.fromJavaRDD(depJavaRdd).mapToPair(t -> new Tuple2<String, String>(filename, t._2().toString())).setName(filename);
            return newDep.rdd();
        }).filter(t -> t != null).collect(Collectors.toList());
        Seq<RDD<Tuple2<String, String>>> rddSeq = JavaConverters.asScalaBufferConverter(collectedRdds).asScala().toIndexedSeq();
        ClassTag<Tuple2<String, String>> classTag = scala.reflect.ClassTag$.MODULE$.apply(Tuple2.class);
        return new UnionRDD<Tuple2<String, String>>(rdd.sparkContext(), rddSeq, classTag).toJavaRDD();
});
like image 20
racc Avatar answered Oct 04 '22 20:10

racc