In Sql, I can easily update some column value using UPDATE, for example: I have a table (student) like:
student_id, grade, new_student_id
123 B 234
555 A null
UPDATE Student
SET student_id = new_student_id
WHERE new_student_id isNotNull
How can I do it in Spark using SparkSql(PySpark)?
Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.
Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.
In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .
Replace Spark DataFrame Column Value using regexp_replace This is one of the easiest methods that you can use to replace the dataFrame column value. For example, consider following example which replaces “a” with zero.
We are not renaming or converting DataFrame column data type. Following are some methods that you can use to Replace dataFrame column value in Pyspark. Use Translate Function (Recommended for character replace) Now, let us check these methods with an example. Following is the test DataFrame that we will be using in subsequent methods and examples.
Let’s use selectExpr () to convert spark DataFrame column age back to an integer, isGraduated from boolean to string and jobStartDate from date to String. 4. Using SQL Expression to Convert We can also use SQL expression to change the spark DataFrame column type.
While working with Spark DataFrames, many operations that we typically perform over them may return null values in some of the records. From that point onwards, some other operations may result in error if null/empty values are observed and thus we have to somehow replace these values in order to keep processing a DataFrame.
You can use withColumn
to overwrite the existing new_student_id
column with the original new_student_id
value if is not null, or otherwise the value from the student_id
column is used:
from pyspark.sql.functions import col,when
#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])
#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id",
when(col("new_student_id").isNull(), col("student_id")).
otherwise(col("new_student_id")))
cleaned.show()
Using your sample data as input:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| null|
+----------+-----+--------------+
the output data looks as follows:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| 555|
+----------+-----+--------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With