I would like to change the following dataframe: <pre class="prettyprint"><code>--id--rating--timestamp-- ------------------------- | 0 | 5.0 | 231312231 | | 1 | 3.0 | 192312311 | #Epoch time (seconds from 1 Thursday, 1 January 1970) ------------------------- </code></pre> to the following dataframe: <pre class="prettyprint"><code>--id--rating--timestamp-- -------------------------- | 0 | 5.0 | 05 | | 1 | 3.0 | 04 | #Month of year -------------------------- </code></pre> How I can do that?

It's easy using built-in functions <pre class="prettyprint"><code>import org.apache.spark.sql.functions._; import spark.implicits._ val newDF = dataset.withColumn("timestamp", month(from_unixtime('timestamp))); </code></pre> Note that DataFrames are immutable, so you can create new DataFrame but not modify. Of course you can assign this Dataset to the same variable. Note number 2: DataFrame = Dataset[Row], that's why I use both names

If you coming from scala, you can use <code>sql.functions</code> methods inside <code>Dataframe.select</code> or <code>Dataframe.withClumn</code> methods, for your case I think the method <code>month(e: Column): Column</code> can perform the change you want. It will be something like that : <pre class="prettyprint"><code>import org.apache.spark.sql.functions.month df.withColumn("timestamp", month("timestamp") as "month") </code></pre> I do believe that there's an equivalent way in <code>Java</code>, <code>Python</code> and <code>R</code>

Modify spark DataFrame column

Tags:

dataframe

apache-spark

I would like to change the following dataframe:

--id--rating--timestamp--
-------------------------
| 0 | 5.0  |  231312231 |
| 1 | 3.0  |  192312311 | #Epoch time (seconds from 1 Thursday, 1 January 1970)
-------------------------

to the following dataframe:

--id--rating--timestamp--
--------------------------
| 0 |  5.0  |  05        |
| 1 |  3.0  |  04        | #Month of year
--------------------------

How I can do that?

300

asked May 18 '17 14:05

Lechucico

2 Answers

It's easy using built-in functions

import org.apache.spark.sql.functions._;
import spark.implicits._
val newDF = dataset.withColumn("timestamp", month(from_unixtime('timestamp)));

Note that DataFrames are immutable, so you can create new DataFrame but not modify. Of course you can assign this Dataset to the same variable.

Note number 2: DataFrame = Dataset[Row], that's why I use both names

answered Oct 26 '22 23:10

T. Gawęda

If you coming from scala, you can use sql.functions methods inside Dataframe.select or Dataframe.withClumn methods, for your case I think the method month(e: Column): Column can perform the change you want. It will be something like that :

import org.apache.spark.sql.functions.month
df.withColumn("timestamp", month("timestamp") as "month")

I do believe that there's an equivalent way in Java, Python and R

answered Oct 27 '22 00:10

Haroun Mohammedi

Related questions
                            
                                How to create a graph from Array[(Any, Any)] using Graph.fromEdgeTuples
                            
                                get size of parquet file in HDFS for repartition with Spark in Scala
                            
                                Spark on Java - What is the right way to have a static object on all workers
                            
                                DataFrame explode list of JSON objects
                            
                                EMR spark-shell not picking up jars
                            
                                What happens if the data can't fit in memory with cache() in Spark?
                            
                                Memory issue when importing parquet files in Spark
                            
                                Is it possible to obtain specific message offset in Kafka+SparkStreaming?
                            
                                OneHotEncoder in Spark Dataframe in Pipeline
                            
                                How to plot ROC curve and precision-recall curve from BinaryClassificationMetrics
                            
                                Spark on YARN too less vcores used
                            
                                Java FlatMapFunction in Spark: error: is not abstract and does not override abstract method call(String) in FlatMapFunction
                            
                                How to use User Defined Types in Spark 2.0?
                            
                                How to create encoder for custom Java objects?
                            
                                How to partition Spark RDD when importing Postgres using JDBC?
                            
                                Using typesafe config with Spark on Yarn
                            
                                How to avoid boxing bytes in array in custom datasource?
                            
                                Spark: grouping rows in array by key
                            
                                Converting mysql table to spark dataset is very slow compared to same from csv file
                            
                                Pyspark: cast array with nested struct to string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With