How can a deep-copy of a DataFrame be requested - without resorting to a full re-computation of the original DataFrame contents?
The purpose will be in performing a self-join on a Spark Stream.
schema. copy" new schema instance created without old schema modification; In each Dataframe operation, which return Dataframe ("select","where", etc), new Dataframe is created, without modification of original. Original can be used again and again.
To copy Pandas DataFrame, use the copy() method. The DataFrame. copy() method makes a copy of the provided object's indices and data. The copy() method accepts one parameter called deep, and it returns the Series or DataFrame that matches the caller.
Dataframes are immutable. That means you don't have to do deep-copies, you can reuse them multiple times and on every operation new dataframe will be created and original will stay unmodified.
For example:
val df = List((1),(2),(3)).toDF("id")
val df1 = df.as("df1") //second dataframe
val df2 = df.as("df2") //third dataframe
df1.join(df2, $"df1.id" === $"df2.id") //fourth dataframe and df is still unmodified
It seems like a waste of resources, but since all data in dataframe is also immutable, then all four dataframes can reuse references to objects inside them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With