The goal is to implement different type classes (like Semigroup, Monad, Functor, etc.) provided by Scalaz for Spark's RDD (distributed collection). Unfortunately, I cannot make any of the type classes that take higher kinded types (like Monad, Functor, etc.) to work well with RDDs.
RDDs are defined (simplified) as:
abstract class RDD[T: ClassTag](){
def map[U: ClassTag](f: T => U): RDD[U] = {...}
}
Complete code for RDDs can be found here.
Here is one example that works fine:
import scalaz._, Scalaz._
import org.apache.spark.rdd.RDD
implicit def semigroupRDD[A] = new Semigroup[RDD[A]] {
def append(x:RDD[A], y: => RDD[A]) = x.union(y)
}
Here is one example that doesn't work:
implicit def functorRDD = new Functor[RDD] {
override def map[A, B](fa: RDD[A])(f: A => B): RDD[B] = {
fa.map(f)
}
}
This fails with:
error: No ClassTag available for B fa.map(f)
The error is pretty clear. The map implemented in RDD expects a ClassTag (see above). The ScalaZ functor/monads etc., do not have a ClassTag. Is it even possible to make this work without modifying Scalaz and/or Spark?
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
Two types of Apache Spark RDD operations are- Transformations and Actions.
RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program.
MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze.
Short answer: no
For type classes like Functor
, the restriction is that for any A
and B
, unconstrained, given A => B
you have a function lifted RDD[A] => RDD[B]
. In Spark you cannot pick arbitrary A
and B
, since you need a ClassTag
for B
, as you saw.
For other type classes like Semigroup
where the type doesn't change during the operation and therefore does not need a ClassTag
, it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With