write an RDD into HDFS in a spark-streaming context

Tags:

I have a spark streaming environment with spark 1.2.0 where i retrieve data from a local folder and every time I find a new file added to the folder I perform some transformation.

val ssc = new StreamingContext(sc, Seconds(10))
val data = ssc.textFileStream(directory)

In order to perform my analysis on DStream data I have to transform it into an Array

var arr = new ArrayBuffer[String]();
   data.foreachRDD {
   arr ++= _.collect()
}

Then I use data obtained to extract the information I want and to save them on HDFS.

val myRDD  = sc.parallelize(arr)
myRDD.saveAsTextFile("hdfs directory....")

Since I really need to manipulate data with an Array it's impossible to save data on HDFS with DStream.saveAsTextFiles("...") (which would work fine) and I have to save the RDD but with this preocedure I finally have empty output files named part-00000 etc...

With an arr.foreach(println) I am able to see the correct results of the transofmations.

My suspect is that spark tries at every batch to write data in the same files, deleting what was previously written. I tried to save in a dynamic named folder like myRDD.saveAsTextFile("folder" + System.currentTimeMillis().toString()) but always only one foldes is created and the output files are still empty.

How can I write an RDD into HDFS in a spark-streaming context?

512

asked Jul 02 '15 11:07

drstein

1 Answers

You are using Spark Streaming in a way it wasn't designed. I'd either recommend drop using Spark for your use case, or adapt your code so it works the Spark way. Collecting the array to the driver defeats the purpose of using a distributed engine and makes your app effectively single-machine (two machines will also cause more overhead than just processing the data on a single machine).

Everything you can do with an array, you can do with Spark. So just run your computations inside the stream, distributed on the workers, and write your output using DStream.saveAsTextFiles(). You can use foreachRDD + saveAsParquet(path, overwrite = true) to write to a single file.

174

answered Sep 30 '22 23:09

Marius Soutier

Related questions
                            
                                How do I inherit Scaladoc from Scala's standard library?
                            
                                Workaround for Scala RDD not being covariant
                            
                                Scala Play upload file within a form
                            
                                Free Applicative in Scala
                            
                                Invoking a Future inside a receive method and stopping the actor after that
                            
                                Akka Streams: how to wait until several Flows are completed
                            
                                How to check if sbt in test context?
                            
                                Why's type lattice in Scala?
                            
                                case object gets initialized to null - how is that even possible?
                            
                                What is the appropriate return value when the pattern matched is Nil and we want to return Nil?
                            
                                Exceptions and referential transparency
                            
                                Slick 3.0 many-to-many query with the join as an iterable
                            
                                Akka Configuration Exception: Logger specified can't be loaded
                            
                                What does [+A] mean in Scala class declaration? [duplicate]
                            
                                Is `Try` a monad if unit = Success?
                            
                                If statements within Play/Scala JSON parsing?
                            
                                Spark: Counting co-occurrence - Algorithm for efficient multi-pass filtering of huge collections
                            
                                Anonymous parameters working- but not explicit ones - for scala reduceLeft
                            
                                Debug not working with play framework activator, scala and eclipse
                            
                                How to insert value of UUID?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

write an RDD into HDFS in a spark-streaming context

Tags:

scala

apache-spark

hadoop

hdfs

spark-streaming

drstein

People also ask

1 Answers

Marius Soutier

Recent Activity

Donate For Us