Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify spark DataFrame column

I would like to change the following dataframe:

--id--rating--timestamp--
-------------------------
| 0 | 5.0  |  231312231 |
| 1 | 3.0  |  192312311 | #Epoch time (seconds from 1 Thursday, 1 January 1970)
-------------------------

to the following dataframe:

--id--rating--timestamp--
--------------------------
| 0 |  5.0  |  05        |
| 1 |  3.0  |  04        | #Month of year
--------------------------

How I can do that?

like image 300
Lechucico Avatar asked May 18 '17 14:05

Lechucico


People also ask

How do I modify a column in spark?

Spark withColumn() function of the DataFrame is used to update the value of a column. withColumn() function takes 2 arguments; first the column you wanted to update and the second the value you wanted to update with. If the column name specified not found, it creates a new column with the value specified.

How do I change the value of column in spark DataFrame?

You can replace column values of PySpark DataFrame by using SQL string functions regexp_replace(), translate(), and overlay() with Python examples.

How do I edit a spark data frame?

You can do update a PySpark DataFrame Column using withColum(), select() and sql(), since DataFrame's are distributed immutable collection you can't really change the column values however when you change the value using withColumn() or any approach, PySpark returns a new Dataframe with updated values.

How do I change the DataFrame column name in spark?

1. Using Spark withColumnRenamed – To rename DataFrame column name. Spark has a withColumnRenamed() function on DataFrame to change a column name. This is the most straight forward approach; this function takes two parameters; the first is your existing column name and the second is the new column name you wish for.


2 Answers

It's easy using built-in functions

import org.apache.spark.sql.functions._;
import spark.implicits._
val newDF = dataset.withColumn("timestamp", month(from_unixtime('timestamp)));

Note that DataFrames are immutable, so you can create new DataFrame but not modify. Of course you can assign this Dataset to the same variable.

Note number 2: DataFrame = Dataset[Row], that's why I use both names

like image 83
T. Gawęda Avatar answered Oct 26 '22 23:10

T. Gawęda


If you coming from scala, you can use sql.functions methods inside Dataframe.select or Dataframe.withClumn methods, for your case I think the method month(e: Column): Column can perform the change you want. It will be something like that :

import org.apache.spark.sql.functions.month
df.withColumn("timestamp", month("timestamp") as "month")

I do believe that there's an equivalent way in Java, Python and R

like image 40
Haroun Mohammedi Avatar answered Oct 27 '22 00:10

Haroun Mohammedi