I am trying to select all record recorded at Sunday through SparkSQL. I have the following try but in vain.
SELECT * FROM mytable WHERE DATEPART(WEEKDAY, create_time) = 0
SELECT * FROM mytable WHERE strftime("%w", create_time) = 0
How to get day of week in SparkSQL?
We can easily get the day of week using date_format() .
In order to get day of month, day of year and day of week from date in pyspark we will be using dayofmonth(), dayofyear() and dayofweek() function respectively. dayofyear() Function with column name as argument extracts nth day of year from date in pyspark.
Spark SQL can automatically capture the schema of a JSON dataset and load it as a DataFrame. Hive comes bundled with the Spark library as HiveContext, which inherits from SQLContext. Parquet is a columnar format, supported by many data processing systems.
SPARK 1.5.0 has a date_format
function that accepts a format as an argument. This format returns a name of a week day from a timestamp:
select date_format(my_timestamp, 'EEEE') from ....
Result: e.g. 'Tuesday'
If the create_time is in the format of UTC, you can use the following to filter out specific days in SparkSQL. I used Spark 1.6.1:
select id, date_format(from_unixtime(created_utc), 'EEEE') from testTable where date_format(from_unixtime(created_utc), 'EEEE') == "Wednesday"
If you specify 'EEEE', the day of the week is spelled out completely. You can use 'E' to specify the shortened version, e.g. Wed. You can find more info here: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
Here's an example that shows the output from the different options. Suppose you have this DataFrame.
+----------+
| some_date|
+----------+
|2021-01-10|
|2021-01-11|
|2021-01-12|
|2021-01-13|
|2021-01-14|
|2021-01-15|
|2021-01-16|
| null|
+----------+
Here are the results of using the dayofweek
, date_format
, and dayOfWeekStr
(from spark-daria) functions.
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("dayofweek", dayofweek(col("some_date")))
.withColumn("date_format", date_format(col("some_date"), "EEEE"))
.withColumn("dayOfWeekStr", dayOfWeekStr(col("dayofweek")))
.show()
+----------+---------+-----------+------------+
| some_date|dayofweek|date_format|dayOfWeekStr|
+----------+---------+-----------+------------+
|2021-01-10| 1| Sunday| Sun|
|2021-01-11| 2| Monday| Mon|
|2021-01-12| 3| Tuesday| Tue|
|2021-01-13| 4| Wednesday| Wed|
|2021-01-14| 5| Thursday| Thu|
|2021-01-15| 6| Friday| Fri|
|2021-01-16| 7| Saturday| Sat|
| null| null| null| null|
+----------+---------+-----------+------------+
The daysofweek
output is best for date addition with date_add
or date_sub
, as described in this post.
The date_format
solution is best for customizing the dates for a given format (e.g. for a human readable deliverable).
Some date functions, like next_day
take a day in string form as argument, and that's when dayOfWeekStr
comes in handy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With