Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to change column value in spark sql

In Sql, I can easily update some column value using UPDATE, for example: I have a table (student) like:

student_id, grade, new_student_id
123             B      234
555             A      null

UPDATE Student
SET student_id = new_student_id
WHERE new_student_id isNotNull

How can I do it in Spark using SparkSql(PySpark)?

like image 557
rainyballball Avatar asked Mar 12 '17 04:03

rainyballball


People also ask

How do I change the value of column in spark?

Update the column value Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with.

Can we use update in spark SQL?

Spark SQL now supports update, delete and such data modification operations if the underlying table is in delta format.

How do I assign a value to a column in a DataFrame in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I replace a column in a spark dataframe?

Replace Spark DataFrame Column Value using regexp_replace This is one of the easiest methods that you can use to replace the dataFrame column value. For example, consider following example which replaces “a” with zero.

How to replace Dataframe column value in pyspark?

We are not renaming or converting DataFrame column data type. Following are some methods that you can use to Replace dataFrame column value in Pyspark. Use Translate Function (Recommended for character replace) Now, let us check these methods with an example. Following is the test DataFrame that we will be using in subsequent methods and examples.

How to convert spark dataframe column age back to an integer?

Let’s use selectExpr () to convert spark DataFrame column age back to an integer, isGraduated from boolean to string and jobStartDate from date to String. 4. Using SQL Expression to Convert We can also use SQL expression to change the spark DataFrame column type.

Why is my spark dataframe returning null values?

While working with Spark DataFrames, many operations that we typically perform over them may return null values in some of the records. From that point onwards, some other operations may result in error if null/empty values are observed and thus we have to somehow replace these values in order to keep processing a DataFrame.


1 Answers

You can use withColumn to overwrite the existing new_student_id column with the original new_student_id value if is not null, or otherwise the value from the student_id column is used:

from pyspark.sql.functions import col,when

#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])

#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id", 
          when(col("new_student_id").isNull(), col("student_id")).
          otherwise(col("new_student_id")))
cleaned.show()

Using your sample data as input:

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|          null|
+----------+-----+--------------+

the output data looks as follows:

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|           555|
+----------+-----+--------------+
like image 191
Alex Avatar answered Sep 17 '22 02:09

Alex