Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Workaround for importing spark implicits everywhere

I'm new to Spark 2.0 and using datasets in our code base. I'm kinda noticing that I need to import spark.implicits._ everywhere in our code. For example:

File A
class A {
    def job(spark: SparkSession) = {
        import spark.implcits._
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }
    private def doSomething(ds: Dataset[Foo], spark: SparkSession) = {
        import spark.implicits._
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    def doSomething(ds: Dataset[Foo]) = {
        import spark.implicits._
        ds.map(e => "SomeString")
    }
}

What I wanted to ask is if there's a cleaner way to be able to do

ds.map(e => "SomeString")

without importing implicits in every function where I do the map? If I don't import it, I get the following error:

Error:(53, 13) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.

like image 901
user2103008 Avatar asked Aug 16 '17 23:08

user2103008


1 Answers

Something that would help a bit would be to do the import inside the class or object instead of each function. For your "File A" and "File B" examples:

File A
class A {
    val spark = SparkSession.builder.getOrCreate()
    import spark.implicits._

    def job() = {
        //create dataset ds
        val b = new B(spark)
        b.doSomething(ds)
        doSomething(ds)
    }

    private def doSomething(ds: Dataset[Foo]) = {
        ds.map(e => 1)            
    }
}

File B
class B(spark: SparkSession) {
    import spark.implicits._

    def doSomething(ds: Dataset[Foo]) = {    
        ds.map(e => "SomeString")
    }
}

In this way, you get a manageable amount of imports.

Unfortunately, to my knowledge there is no other way to reduce the number of imports even more. This is due to the need to the SparkSession object when doing the actual import. Hence, this is the best that can be done.


Update:

An even more convinient method is to create a Scala Trait and combine it with an empty Object. This allows for easy import of implicits at the top of each file while allowing extending the trait to use the SparkSession object.

Example:

trait SparkJob {
  val spark: SparkSession = SparkSession.builder.
    .master(...)
    .config(..., ....) // Any settings to be applied
    .getOrCreate()
}

object SparkJob extends SparkJob {}

With this we can do the following for File A and B:

File A:

import SparkJob.spark.implicits._
class A extends SparkJob {
  spark.sql(...) // Allows for usage of the SparkSession inside the class
  ...
}

File B:

import SparkJob.spark.implicits._
class B extends SparkJob {
  ...    
}

Note that it's only necessary to extend SparkJob for for the classes or objects that use the spark object itself.

like image 169
Shaido Avatar answered Oct 20 '22 02:10

Shaido