Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clone/Deep-Copy a Spark DataFrame

How can a deep-copy of a DataFrame be requested - without resorting to a full re-computation of the original DataFrame contents?

The purpose will be in performing a self-join on a Spark Stream.

like image 692
WestCoastProjects Avatar asked Jul 15 '19 21:07

WestCoastProjects


People also ask

How do I copy a PySpark DataFrame to another?

schema. copy" new schema instance created without old schema modification; In each Dataframe operation, which return Dataframe ("select","where", etc), new Dataframe is created, without modification of original. Original can be used again and again.

How do I copy one DataFrame to another?

To copy Pandas DataFrame, use the copy() method. The DataFrame. copy() method makes a copy of the provided object's indices and data. The copy() method accepts one parameter called deep, and it returns the Series or DataFrame that matches the caller.


1 Answers

Dataframes are immutable. That means you don't have to do deep-copies, you can reuse them multiple times and on every operation new dataframe will be created and original will stay unmodified.

For example:

val df = List((1),(2),(3)).toDF("id")

val df1 = df.as("df1") //second dataframe
val df2 = df.as("df2") //third dataframe

df1.join(df2, $"df1.id" === $"df2.id") //fourth dataframe and df is still unmodified

It seems like a waste of resources, but since all data in dataframe is also immutable, then all four dataframes can reuse references to objects inside them.

like image 198
Krzysztof Atłasik Avatar answered Nov 22 '22 07:11

Krzysztof Atłasik