I'm trying to write a function to operate on RDD[Seq[String]] objects, e.g.:
def foo(rdd: RDD[Seq[String]]) = { println("hi") }
This function cannot be called on objects of type RDD[Array[String]]:
val testRdd : RDD[Array[String]] = sc.textFile("somefile").map(_.split("\\|", -1))
foo(testRdd)
->
error: type mismatch;
found : org.apache.spark.rdd.RDD[Array[String]]
required: org.apache.spark.rdd.RDD[Seq[String]]
I guess that's because RDD isn't covariant.
I've tried a bunch of definitions of foo to get around this. Only one of them has compiled:
def foo2[T[String] <: Seq[String]](rdd: RDD[T[String]]) = { println("hi") }
But it's still broken:
foo2(testRdd)
->
<console>:101: error: inferred type arguments [Array] do not conform to method foo2's type
parameter bounds [T[String] <: Seq[String]]
foo2(testRdd)
^
<console>:101: error: type mismatch;
found : org.apache.spark.rdd.RDD[Array[String]]
required: org.apache.spark.rdd.RDD[T[String]]
Any idea how I can work around this? This is all taking place in the Spark shell.
The coalesce() and repartition() transformations are both used for changing the number of partitions in the RDD. The main difference is that: If we are increasing the number of partitions use repartition() , this will perform a full shuffle.
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Action count() returns the number of elements in RDD. For example, RDD has values {1, 2, 2, 3, 4, 5, 5, 6} in this RDD “rdd. count()” will give the result 8.
Resilient Distributed Dataset (RDD) is the fundamental data structure of Spark. They are immutable Distributed collections of objects of any type. As the name suggests is a Resilient (Fault-tolerant) records of data that resides on multiple nodes.
For this you can use a view bound.
Array
is not a Seq
, but it can be viewed as a Seq
.
def foo[T <% Seq[String]](rdd: RDD[T]) = ???
The <%
says that T
can be viewed as a Seq[String]
so that whenever you use a Seq[String]
method on T
then T
will be converted to Seq[String]
.
For Array[A]
to be viewed as Seq[A]
there needs to be an implicit function in scope that can convert Array
s to Seq
s. As Ionuț G. Stan said, it exists in scala.Predef.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With