Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to modify a column value in a row of a spark dataframe?

I am working with data frame with following structure enter image description here

Here I need to modify each record so that if a column is listed in post_event_list I need to populate that column with corresponding post_column value. So in the above example for both records I need to populate col4 and col5 with post_col4 and post_col5 values. Can someone please help me to do this in pyspark.

like image 255
ab_ Avatar asked Sep 09 '16 13:09

ab_


People also ask

How do I change the value of a column in PySpark?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

Can you edit the contents of an existing spark DataFrame?

Just like SQL, you can join two dataFrames and perform various actions and transformations on Spark dataFrames. As mentioned earlier, Spark dataFrames are immutable. You cannot change existing dataFrame, instead, you can create new dataFrame with updated values.

How do you change a value in PySpark?

regexp_replace(), translate(), and overlay() functions can be used to replace values in PySpark Dataframes.

How do I convert columns to rows in spark?

Spark pivot() function is used to pivot/rotate the data from one DataFrame/Dataset column into multiple columns (transform row to column) and unpivot is used to transform it back (transform columns to rows).


2 Answers

Maybe this is what you want in pyspark2

suppose df is the DataFrame

row = df.rdd.first()

d = row.asDict()
d['col4'] = d['post_col4']
new_row = pyspark.sql.types.Row(**d) 

now we has a new Row object;

put these codes in a map function can help to change all df.

like image 117
Zee Cheung Avatar answered Sep 30 '22 16:09

Zee Cheung


You can use when/otherwise in pyspark.sql.functions. Something likes:

import pyspark.sql.functions as sf
from pyspark.sql.types import BooleanType

contains_col4_udf = udf(lambda x: 'col4' in x, BooleanType())
df.select(sf.when(contains_col4_udf('post_event_list'), sf.col('post_col4')).otherwise(sf.col('col_4')).alias('col_4'))

Here is the doc: https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.otherwise

like image 36
phi Avatar answered Sep 30 '22 14:09

phi