I have a Spark DataFrame as shown below:
#Create DataFrame
df <- data.frame(name = c("Thomas", "William", "Bill", "John"),
dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08'))
df <- createDataFrame(df)
#Make sure df$dates column is in 'date' format
df <- withColumn(df, 'dates', cast(df$dates, 'date'))
name | dates
--------------------
Thomas |2017-01-05
William |2017-02-23
Bill |2017-03-16
John |2017-04-08
I want to change dates
to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine.
name | dates
--------------------
Thomas |2017-01-31
William |2017-02-28
Bill |2017-03-31
John |2017-04-30
Spark has a last_day() and add_months() that you can use to get the last day of previous month. Use the add_months() with -1 to get the previous month, then use the last_day() to get the last day of the month. Here is the sample Scala code.
In order to get month, year and quarter from pyspark we will be using month(), year() and quarter() function respectively. year() Function with column name as argument extracts year from date in pyspark. month() Function with column name as argument extracts month from date in pyspark.
(i) By applying to_date , we are changing the datatype of this column (string) to Date datatype. Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly. (ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy .
You may use the following (PySpark):
from pyspark.sql.functions import last_day
df.select('name', last_day(df.dates).alias('dates')).show()
To clarify, last_day(date)
returns the last day of the month of which date belongs to.
I'm pretty sure there is a similar function in sparkR https://spark.apache.org/docs/1.6.2/api/R/last_day.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With