Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scalaz Type Classes for Apache Spark RDDs

The goal is to implement different type classes (like Semigroup, Monad, Functor, etc.) provided by Scalaz for Spark's RDD (distributed collection). Unfortunately, I cannot make any of the type classes that take higher kinded types (like Monad, Functor, etc.) to work well with RDDs.

RDDs are defined (simplified) as:

abstract class RDD[T: ClassTag](){
   def map[U: ClassTag](f: T => U): RDD[U] = {...}
}

Complete code for RDDs can be found here.

Here is one example that works fine:

import scalaz._, Scalaz._
import org.apache.spark.rdd.RDD

implicit def semigroupRDD[A] = new Semigroup[RDD[A]] {
   def append(x:RDD[A], y: => RDD[A]) = x.union(y)
}

Here is one example that doesn't work:

implicit def functorRDD =  new Functor[RDD] {
   override def map[A, B](fa: RDD[A])(f: A => B): RDD[B] = {
      fa.map(f)
   }
}

This fails with:

error: No ClassTag available for B fa.map(f)

The error is pretty clear. The map implemented in RDD expects a ClassTag (see above). The ScalaZ functor/monads etc., do not have a ClassTag. Is it even possible to make this work without modifying Scalaz and/or Spark?

like image 414
marios Avatar asked Apr 17 '16 04:04

marios


People also ask

What are the types of operations can be applied to RDDs?

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

What are the types of RDD in Spark?

Two types of Apache Spark RDD operations are- Transformations and Actions.

Can RDD contain user defined classes?

RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program.

What is the datatype of RDD?

MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs. Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by Breeze.


1 Answers

Short answer: no

For type classes like Functor, the restriction is that for any A and B, unconstrained, given A => B you have a function lifted RDD[A] => RDD[B]. In Spark you cannot pick arbitrary A and B, since you need a ClassTag for B, as you saw.

For other type classes like Semigroup where the type doesn't change during the operation and therefore does not need a ClassTag, it works.

like image 89
adelbertc Avatar answered Nov 06 '22 00:11

adelbertc