What's the performance impact of converting between `DataFrame`, `RDD` and back?

Tags:

While my first instinct is to use DataFrames for everything, it's just not possible -- some operations are clearly easier and / or better performing as RDD operations, not to mention certain APIs like GraphX only work on RDDs.

I seem to be spending a lot of time these days converting back and forth between DataFrames and RDDs -- so what's the performance hit? Take RDD.checkpoint -- there's no DataFrame equivalent, so what happens under the hood when I do:

val df = Seq((1,2),(3,4)).toDF("key","value")
val rdd = df.rdd.map(...)
val newDf = rdd.map(r => (r.getInt(0), r.getInt(1))).toDF("key","value")

Obviously, this is a trivally small example, but it would be great to know what happens behind the scene in the conversion.

661

asked May 07 '16 12:05

David Griffin

1 Answers

Let's look at df.rdd first. This is defined as:

lazy val rdd: RDD[Row] = {
  // use a local variable to make sure the map closure doesn't capture the whole DataFrame
  val schema = this.schema
  queryExecution.toRdd.mapPartitions { rows =>
    val converter = CatalystTypeConverters.createToScalaConverter(schema)
    rows.map(converter(_).asInstanceOf[Row])
  }
}

So firstly, it runs queryExecution.toRdd, which basically prepares the execution plan based on the operators used to build up the DataFrame, and computes an RDD[InternalRow] that represents the outcome of plan.

Next these InternalRows (which are only for internal use) of that RDD will be mapped to normal Rows. This will entail the following for each row:

override def toScala(row: InternalRow): Row = {
  if (row == null) {
    null
  } else {
    val ar = new Array[Any](row.numFields)
    var idx = 0
    while (idx < row.numFields) {
      ar(idx) = converters(idx).toScala(row, idx)
      idx += 1
    }
    new GenericRowWithSchema(ar, structType)
  }
}

So it loops over all elements, coverts them to 'scala' space (from Catalyst space), and creates the final row with them. toDf will pretty much do these things in reverse.

This all will indeed have some impact on your performance. How much depends on how complex these operations are compared to the things you do with the data. The bigger possible impact however will be that Spark's Catalyst optimizer can only optimize the operations between the conversions to and from RDDs, rather than optimize the full execution plan in its whole. It would be interesting to see which operations you have trouble with, I find most things can be done using basic expressions or UDFs. Using modules that only work on RDDs is a very valid use case though!

115

answered Oct 09 '22 21:10

sgvd

Related questions
                            
                                Unexpected Result when Overriding 'val'
                            
                                Jackson mapper with generic class in scala
                            
                                Intellij does not recognize Scala List operator
                            
                                Shuffled vs non-shuffled coalesce in Apache Spark
                            
                                Scala factory method with generics
                            
                                What to do when hitting the queue size of slick?
                            
                                Change Iterable[(String, Double)] of an RDD to Array or List
                            
                                Scala - currying function with implicit value
                            
                                Expected start of definition error in Scala
                            
                                For comprehension: how to run Futures sequentially
                            
                                What happens if an RDD can't fit into memory in Spark? [duplicate]
                            
                                Exception causes Future to never complete
                            
                                Renaming a fat jar with Maven
                            
                                How to convert a sparse vector to dense in Scala Spark?
                            
                                How does `.get("key")` on a `Option[Map[String,String]]` work
                            
                                how to obtain the trained best model from a crossvalidator
                            
                                spark group multiple rdd items by key
                            
                                no valid constructor on spark
                            
                                Transforming JSON with state in circe
                            
                                How long did it take to run an Observable using RxJava (ReactiveX)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the performance impact of converting between `DataFrame`, `RDD` and back?

Tags:

scala

apache-spark

David Griffin

People also ask

1 Answers

sgvd

Recent Activity

Donate For Us