Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Workaround for Scala RDD not being covariant

I'm trying to write a function to operate on RDD[Seq[String]] objects, e.g.:

def foo(rdd: RDD[Seq[String]]) = { println("hi") }

This function cannot be called on objects of type RDD[Array[String]]:

val testRdd : RDD[Array[String]] = sc.textFile("somefile").map(_.split("\\|", -1))
foo(testRdd)

->
error: type mismatch;
found   : org.apache.spark.rdd.RDD[Array[String]]
required: org.apache.spark.rdd.RDD[Seq[String]]

I guess that's because RDD isn't covariant.

I've tried a bunch of definitions of foo to get around this. Only one of them has compiled:

def foo2[T[String] <: Seq[String]](rdd: RDD[T[String]]) = { println("hi") }

But it's still broken:

foo2(testRdd)


->
<console>:101: error: inferred type arguments [Array] do not conform to method foo2's type
parameter bounds [T[String] <: Seq[String]]
          foo2(testRdd)
          ^
<console>:101: error: type mismatch;
found   : org.apache.spark.rdd.RDD[Array[String]]
required: org.apache.spark.rdd.RDD[T[String]]

Any idea how I can work around this? This is all taking place in the Spark shell.

like image 891
user3666020 Avatar asked May 22 '14 16:05

user3666020


People also ask

Which of the below method can be used to verify the number of partitions in RDD?

The coalesce() and repartition() transformations are both used for changing the number of partitions in the RDD. The main difference is that: If we are increasing the number of partitions use repartition() , this will perform a full shuffle.

How many ways can you create RDD in Spark?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Is count an action in Spark?

Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.

What is resilient distributed datasets?

Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.


1 Answers

For this you can use a view bound.

Array is not a Seq, but it can be viewed as a Seq.

def foo[T <% Seq[String]](rdd: RDD[T]) = ???

The <% says that T can be viewed as a Seq[String] so that whenever you use a Seq[String] method on T then T will be converted to Seq[String].

For Array[A] to be viewed as Seq[A] there needs to be an implicit function in scope that can convert Arrays to Seqs. As Ionuț G. Stan said, it exists in scala.Predef.

like image 105
ggovan Avatar answered Oct 08 '22 03:10

ggovan