Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to solve SPARK-5063 in nested map functions

RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

As the error says, i'm trying to map(transformation) a JavaRDD object within the main map function, how is it possible with Apache Spark?

The main JavaPairRDD object (TextFile and Word are defined classes):

JavaPairRDD<TextFile, JavaRDD<Word>> filesWithWords = new...

and map function:

filesWithWords.map(textFileJavaRDDTuple2 -> textFileJavaRDDTuple2._2().map(word -> new Word(word.getText(), (long) textFileJavaRDDTuple2._1().getText().split(word.getText()).length)));

also i tried foreach instead map function, but not working. (And of course searched SPARK-5063)

like image 207
Alper M. Avatar asked May 01 '15 23:05

Alper M.


2 Answers

In the same way nested operations on RDDs are not supported, nested RDD types are not possible in Spark. RDDs are only defined at the driver where, in combination with their SparkContext they can schedule operations on the data they represent.

So, the root cause we need to address in this case is the datatype:

JavaPairRDD<TextFile, JavaRDD<Word>> filesWithWords

Which in Spark will have no possible valid use. Depending on the usecase, which is not further explained in the question, this type should become one of:

A collection of RDDs, with the text file they refer to:

Map<TextFile,RDD<Word>>

Or a collection of (textFile,Word) by text file:

JavaPairRDD<TextFile, Word>

Or a collection of words with their corresponding TextFile:

JavaPairRDD<TextFile, List<Word>>

Once the type is corrected, the issues with the nested RDD operations will be naturally solved.

like image 137
maasg Avatar answered Oct 11 '22 03:10

maasg


When I got to this exact same point in my learning curve for Spark (tried and failed to use nested RDDs) I switched to DataFrames and was able to accomplish the same thing using joins instead. Also, in general, DataFrames appear to be almost twice as fast as RDDs -- at least for the work I have been doing.

like image 31
David Griffin Avatar answered Oct 11 '22 02:10

David Griffin