How to store and read data from Spark PairRDD

Question

Spark PairRDD has the option to save the file.

JavaRDD<String> baseRDD = context.parallelize(Arrays.asList("This", "is", "dummy", "data"));

JavaPairRDD<String, Integer> myPairRDD =
    baseRDD.mapToPair(new PairFunction<String, String, Integer>() {

      @Override
      public Tuple2<String, Integer> call(String input) throws Exception {
        // TODO Auto-generated method stub
        return new Tuple2<String, Integer>(input, input.length());
      }
    });

myPairRDD.saveAsTextFile("path");

Spark context textfile reads the data to JavaRDD only.

How to reconstruct the PairRDD directly from source?

Note:

Possible approach is to read the data to JavaRDD<String> and construct JavaPairRDD.

But with huge data it is taking considerable amount of resources.

Storing this intermediate file in non-text format is also fine.
Execution environment - JRE 1.7

abalcerek · Accepted Answer

You can save them as object file if you don't mind result file not being human readable.

save file:

myPairRDD.saveAsObjectFile(path);

and then you can read pairs like this:

JavaPairRDD.fromJavaRDD(sc.objectFile(path))

EDIT:

working example:

JavaRDD<String> rdd = sc.parallelize(Lists.newArrayList("1", "2"));
rdd.mapToPair(p -> new Tuple2<>(p, p)).saveAsObjectFile("c://example");
JavaPairRDD<String, String> pairRDD 
    = JavaPairRDD.fromJavaRDD(sc.objectFile("c://example"));
pairRDD.collect().forEach(System.out::println);

How to store and read data from Spark PairRDD

Tags:

apache-spark

Vijay Innamuri

1 Answers

abalcerek

Recent Activity

Donate For Us

How to store and read data from Spark PairRDD

Tags:

apache-spark

Vijay Innamuri

1 Answers

abalcerek

Related questions

Recent Activity

Donate For Us