I have the following DataFrame:
+----------+
| date|
+----------+
|2017-01-25|
|2017-01-21|
|2017-01-12|
+----------+
Here is the code the create above DataFrame:
import pyspark.sql.functions as f
rdd = sc.parallelize([("2017/11/25",), ("2017/12/21",), ("2017/09/12",)])
df = sqlContext.createDataFrame(rdd, ["date"]).withColumn("date", f.to_date(f.col("date"), "yyyy/MM/dd"))
df.show()
I want a new column with the first date of month for each row, just replace the day to "01" in all the dates
+----------++----------+
| date| first_date|
+----------++----------+
|2017-11-25| 2017-11-01|
|2017-12-21| 2017-12-01|
|2017-09-12| 2017-09-01|
+----------+-----------+
There is a last_day function in PySpark.sql.function, however, there is no first_day function.
I tried using date_sub to do this but did not work: I get a column not Iterable error because the second argument to date_sub cannot be a column and has to be an integer.
f.date_sub(f.col('date'), f.dayofmonth(f.col('date')) - 1 )
In PySpark use date_format() function to convert the DataFrame column from Date to String format.
Spark Get Day of the Year from Date column As mentioned earlier we use Date formatting patterns to get the Day of the Year from Date or Timestamp type columns. D – date formatting pattern is used to extract the day of the year. for example date '2019-02-01' return 32.
Returns the last day of the month which the given date belongs to. New in version 1.5.
You can use trunc
:
import pyspark.sql.functions as f
df.withColumn("first_date", f.trunc("date", "month")).show()
+----------+----------+
| date|first_date|
+----------+----------+
|2017-11-25|2017-11-01|
|2017-12-21|2017-12-01|
|2017-09-12|2017-09-01|
+----------+----------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With