Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag.

I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?

How to do it without using withcolumn() to create new column and drop() to drop the old column?

I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?

    df2 = spark.createDataFrame(
        [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
        ('session', "timestamp1", "id2"))

    df2.select(df2.id2 > 0).show()

+---------+
|(id2 > 0)|
+---------+
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
+---------+
 # Attempting to overwriting df2.id2
    df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
    df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
|      1|         1| NaN|
|      1|         2| 5.0|
|      1|         3| NaN|
|      1|         4| NaN|
|      1|         5|10.0|
|      1|         6| NaN|
|      1|         6| NaN|
+-------+----------+----+
like image 629
GeorgeOfTheRF Avatar asked Jun 19 '17 06:06

GeorgeOfTheRF


People also ask

How do I update a column in spark DataFrame?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do you replace a column in a DataFrame in PySpark?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.

What is the use of withColumn in spark?

Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error.

How do you use the Replace function in PySpark DataFrame?

The function withColumn is called to add (or replace, if the name exists) a column to the data frame. The function regexp_replace will generate a new column by replacing all substrings that match the pattern.


1 Answers

You can use

d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")

The withColumnRenamed renames the existing column to new name.

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.

d3 = df2.select((df2.id2 > 0).alias("id2"))

Above should work fine in your case.

Hope this helps!

like image 174
koiralo Avatar answered Sep 24 '22 13:09

koiralo