Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: add a new field to a data frame Row element

I have the following element:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good') 

The Row is of Spark data frame Row class. I want to append a new field to a, so that a would look like:

a = Row(ts=1465326926253, myid=u'1234567', mytype=u'good', name = u'john') 
like image 954
Edamame Avatar asked Oct 01 '16 00:10

Edamame


People also ask

How do I add a new column to a DataFrame in PySpark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

How do I update my spark data frame?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.


1 Answers

Here is an updated answer that works. First you have to create a dictionary then update the dict and then write it out to a pyspark Row.

Code is as follows:

from pyspark.sql import Row

#Creating the pysql row
row = Row(field1=12345, field2=0.0123, field3=u'Last Field')

#Convert to python dict
temp = row.asDict()

#Do whatever you want to the dict. Like adding a new field or etc.
temp["field4"] = "it worked!"

# Save or output the row to a pyspark rdd
output = Row(**temp)

#How it looks
output

In [1]:
Row(field1=12345, field2=0.0123, field3=u'Last Field', field4='it worked!')
like image 198
Ish Mitch Avatar answered Oct 10 '22 23:10

Ish Mitch