I am very new to Scala. I have experience in Java and R
I am confused about the immutability of DataFrames and memory management. The reason is this:
A Dataframe in R is also immutable. Subsequently, it was found in R to be unworkable. (Simplistically put) when working with a very large number of columns, each transformation led to a new Dataframe. 1000 consecutive operations on 1000 consecutive columns would lead to 1000 Dataframe objects). Now, most data scientists prefer R's data.table which performas operations by reference on a single data.table object.
Scala's dataframe (to a newbie) seems have a similar problem. The following code, for example, seems to create 1000 dataframes when renaming 1000 columns. Despite the foldLeft(), each call to withColumn() creates a new instance of DataFrame.
So, do I trust a very efficient garbage collection in Scala, or do I need to try and limit the number of immutable instances created. If the latter, what techniques should I be looking at?
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType):
DataFrame =
{
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
if (columnsToBeCasted.length > 0)
{
println(s"Found ${columnsToBeCasted.length} columns " +
s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
s" - casting to ${targetType.typeName.capitalize}Type")
}
columnsToBeCasted.foldLeft(df)
{ (foldedDf, col) =>
castColumnTo(foldedDf, col.name, targetType)
}
}
This method will return a new instance on each call
private def castColumnTo(df: DataFrame, cn: String, tpe: DataType):
DataFrame =
{
//println("castColumnTo")
df.withColumn(cn, df(cn).cast(tpe)
)
}
The difference is essentially laziness. Each new DataFrame that is returned is not materialized in memory. It just stores the base DataFrame and the function that should be applied to it. It's essentially an execution plan for how to create some data, not the data itself.
When it comes time to actually execute and save the result somewhere, then all 1000 operations can be applied to each row in parallel, so you get 1 additional output DataFrame. Spark condenses as many operations together as possible, and does not materialize anything unnecessary or that hasn't been explicitly requested to be saved or cached.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With