Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mapPartitions returns empty array

I have the following RDD which has 4 partitions:-

val rdd=sc.parallelize(1 to 20,4)

Now I try to call mapPartitions on this:-

scala> rdd.mapPartitions(x=> { println(x.size); x }).collect
5
5
5
5
res98: Array[Int] = Array()

Why does it return empty array? The anonymoys function is simply returning the same iterator it received, then how is it returning empty array? The interesting part is that if I remove println statement, it indeed returns non empty array:-

scala> rdd.mapPartitions(x=> { x }).collect
res101: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)

This I don't understand. How come the presence of println (which is simply printing size of iterator) affecting the final outcome of the function?

like image 892
Dhiraj Avatar asked Aug 17 '15 02:08

Dhiraj


Video Answer


1 Answers

That's because x is a TraversableOnce, which means that you traversed it by calling size and then returned it back....empty.

You could work around it a number of ways, but here is one:

rdd.mapPartitions(x=> {
  val list = x.toList;
  println(list.size);
  list.toIterator
}).collect
like image 66
Justin Pihony Avatar answered Nov 05 '22 11:11

Justin Pihony