I want to overwrite a spark column with a new column which is a binary flag. I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas? How to do it without using withcolumn() to create new column and drop() to drop the old column? I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()? <pre class="prettyprint"><code> df2 = spark.createDataFrame( [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) df2.select(df2.id2 > 0).show() +---------+ |(id2 > 0)| +---------+ | true| | true| | true| | true| | true| | true| | true| +---------+ # Attempting to overwriting df2.id2 df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2') df2.show() #Overwriting unsucessful +-------+----------+----+ |session|timestamp1| id2| +-------+----------+----+ | 1| 1| NaN| | 1| 2| 5.0| | 1| 3| NaN| | 1| 4| NaN| | 1| 5|10.0| | 1| 6| NaN| | 1| 6| NaN| +-------+----------+----+ </code></pre>

You can use <pre class="prettyprint"><code>d1.withColumnRenamed("colName", "newColName") d1.withColumn("newColName", $"colName") </code></pre> The <code>withColumnRenamed</code> renames the existing column to new name. The <code>withColumn</code> creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one. In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use. <pre class="prettyprint"><code>d3 = df2.select((df2.id2 > 0).alias("id2")) </code></pre> Above should work fine in your case. Hope this helps!

How to overwrite entire existing column in Spark dataframe with new column?

Tags:

dataframe

apache-spark

apache-spark-sql

pyspark

apache-spark-mllib

I want to overwrite a spark column with a new column which is a binary flag.

I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas?

How to do it without using withcolumn() to create new column and drop() to drop the old column?

I know that spark dataframe is immutable, is that the reason or there is a different way to overwrite without using withcolumn() & drop()?

    df2 = spark.createDataFrame(
        [(1, 1, float('nan')), (1, 2, float(5)), (1, 3, float('nan')), (1, 4, float('nan')), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
        ('session', "timestamp1", "id2"))

    df2.select(df2.id2 > 0).show()

+---------+
|(id2 > 0)|
+---------+
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
|     true|
+---------+
 # Attempting to overwriting df2.id2
    df2.id2=df2.select(df2.id2 > 0).withColumnRenamed('(id2 > 0)','id2')
    df2.show()
#Overwriting unsucessful
+-------+----------+----+
|session|timestamp1| id2|
+-------+----------+----+
|      1|         1| NaN|
|      1|         2| 5.0|
|      1|         3| NaN|
|      1|         4| NaN|
|      1|         5|10.0|
|      1|         6| NaN|
|      1|         6| NaN|
+-------+----------+----+

629

asked Jun 19 '17 06:06

GeorgeOfTheRF

1 Answers

You can use

d1.withColumnRenamed("colName", "newColName")
d1.withColumn("newColName", $"colName")

The withColumnRenamed renames the existing column to new name.

The withColumn creates a new column with a given name. It creates a new column with same name if there exist already and drops the old one.

In your case changes are not applied to the original dataframe df2, it changes the name of column and return as a new dataframe which should be assigned to new variable for the further use.

d3 = df2.select((df2.id2 > 0).alias("id2"))

Above should work fine in your case.

Hope this helps!

174

answered Sep 24 '22 13:09

koiralo

Related questions
                            
                                Use Spark to list all files in a Hadoop HDFS directory?
                            
                                Apache Drill vs Spark [closed]
                            
                                Building a StructType from a dataframe in pyspark
                            
                                How to select last row and also how to access PySpark dataframe by index?
                            
                                How to connect to remote hive server from spark [duplicate]
                            
                                Is dataframe.show() an action in spark?
                            
                                dynamically bind variable/parameter in Spark SQL?
                            
                                Spark UI on AWS EMR
                            
                                How to fix java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List to field type scala.collection.Seq?
                            
                                Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?
                            
                                Scala Error: Could not find or load main class in both Scala IDE and Eclipse
                            
                                How to configure Apache Spark random worker ports for tight firewalls?
                            
                                Where is the Spark UI on Google Dataproc?
                            
                                How to convert ArrayType to DenseVector in PySpark DataFrame?
                            
                                Executing separate streaming queries in spark structured streaming
                            
                                Unable to run a basic GraphFrames example
                            
                                unexpected type: <class 'pyspark.sql.types.DataTypeSingleton'> when casting to Int on a ApacheSpark Dataframe
                            
                                Link Spark with iPython Notebook
                            
                                How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?
                            
                                Efficient way to read specific columns from parquet file in spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With