Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

weekofyear() returning seemingly incorrect results for January 1

I'm not quite sure why my code gives 52 as the answer for: weekofyear("01/JAN/2017") .

Does anyone have a possible explanation for this? Is there a better way to do this?

from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('weekOfYear').getOrCreate()
from pyspark.sql.functions import to_date

df = spark.createDataFrame(
    [(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date")) 

df.show()
+---+-----------+
| id|       date|
+---+-----------+
|  1|01/JAN/2017|
|  2|15/FEB/2017|
+---+-----------+

Calculate the week of the year

df=df.withColumn("weekofyear", functions.weekofyear(to_date(df["date"],"dd/MMM/yyyy")))

df.printSchema()

root
 |-- id: long (nullable = true)
 |-- date: string (nullable = true)
 |-- weekofyear: integer (nullable = true)

df.show()

The 'error' is visible below:

+---+-----------+----------+
| id|       date|weekofyear|
+---+-----------+----------+
|  1|01/JAN/2017|        52|
|  2|15/FEB/2017|         7|
+---+-----------+----------+
like image 542
Grant Shannon Avatar asked Apr 18 '18 16:04

Grant Shannon


1 Answers

It seems like weekofyear() will only return 1 for January 1st if the day of the week is Monday through Thursday.

To confirm, I created a DataFrame with all "01/JAN/YYYY" from 1900 to 2018:

df = sqlCtx.createDataFrame(
    [(1, "01/JAN/{y}".format(y=year),) for year in range(1900,2019)],
    ["id", "date"]
)

Now let's convert it to a date, get the day of the week, and count the values for weekofyear():

import pyspark.sql.functions as f
df.withColumn("d", f.to_date(f.from_unixtime(f.unix_timestamp('date', "dd/MMM/yyyy"))))\
    .withColumn("weekofyear", f.weekofyear("d"))\
    .withColumn("dayofweek", f.date_format("d", "E"))\
    .groupBy("dayofweek", "weekofyear")\
    .count()\
    .show()
#+---------+----------+-----+
#|dayofweek|weekofyear|count|
#+---------+----------+-----+
#|      Sun|        52|   17|
#|      Mon|         1|   18|
#|      Tue|         1|   17|
#|      Wed|         1|   17|
#|      Thu|         1|   17|
#|      Fri|        53|   17|
#|      Sat|        53|    4|
#|      Sat|        52|   12|
#+---------+----------+-----+

Note, I am using Spark v 2.1 where to_date() does not accept a format argument, so I had to use the method described in this answer to convert the string to a date.

Similarly to_date() only returns 1 for:

  • January 2nd, if the day of the week is Monday through Friday.
  • January 3rd, if the day of the week is Monday through Saturday.

Update

This behavior is consistent with the ISO 8601 definition.

like image 167
pault Avatar answered Nov 04 '22 06:11

pault