Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to define custom methods on a DataFrame?

I need to define custom methods on DataFrame. What is the better way to do it? The solution should be scalable, as I intend to define a significant number of custom methods.

My current approach is to create a class (say MyClass) with DataFrame as parameter, define my custom method (say customMethod) in that and define an implicit method which converts DataFrame to MyClass.

implicit def dataFrametoMyClass(df: DataFrame): MyClass = new MyClass(df)

Thus I can call:

dataFrame.customMethod()

Is this the correct way to do it? Open for suggestions.

like image 538
Pravin Gadakh Avatar asked Sep 15 '15 12:09

Pravin Gadakh


2 Answers

Your way is the way to go (see [1]). Even though I solved it a little different, the approach stays similar:

Possibility 1

Implicits

object ExtraDataFrameOperations {
  object implicits {
    implicit def dFWithExtraOperations(df: DataFrame) = DFWithExtraOperations(df)
  }
}

case class DFWithExtraOperations(df: DataFrame) {
  def customMethod(param: String) : DataFrame = {
    // do something fancy with the df
    // or delegate to some implementation
    //
    // here, just as an illustrating example: do a select
    df.select( df(param) )
  }
}

Usage

To use the new customMethod method on a DataFrame:

import ExtraDataFrameOperations.implicits._
val df = ...
val otherDF = df.customMethod("hello")

Possibility 2

Instead of using an implicit method (see above), you can also use an implicit class:

Implicit class

object ExtraDataFrameOperations {
  implicit class DFWithExtraOperations(df : DataFrame) {
     def customMethod(param: String) : DataFrame = {
      // do something fancy with the df
      // or delegate to some implementation
      //
      // here, just as an illustrating example: do a select
      df.select( df(param) )
    }
  }
}

Usage

import ExtraDataFrameOperations._
val df = ...
val otherDF = df.customMethod("hello")

Remark

In case you want to prevent the additional import, turn the object ExtraDataFrameOperations into an package object and store it in in a file called package.scala within your package.

Official documentation / references

[1] The original blog "Pimp my library" by M. Odersky is available at http://www.artima.com/weblogs/viewpost.jsp?thread=179766

like image 146
Martin Senne Avatar answered Nov 01 '22 06:11

Martin Senne


There is a slightly simpler approach: just declare MyClass as implicit

implicit class MyClass(df: DataFrame) { def myMethod = ... }

This automatically creates the implicit conversion method (also called MyClass). You can also make it a value class by adding extends AnyVal which avoids some overhead by not actually creating a MyClass instance at runtime, but this is very unlikely to matter in practice.

Finally, putting MyClass into a package object will allow you to use the new methods anywhere in this package without requiring import of MyClass, which may be a benefit or a drawback for you.

like image 11
Alexey Romanov Avatar answered Nov 01 '22 07:11

Alexey Romanov