Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas datetime week not as expected

When working with Pandas datetimes, I'm trying to group data by the week and year. However, I have noticed some years where the last day of the year ends up grouped with the first week of the same year.

import pandas as pd
day_df = pd.DataFrame(index=pd.date_range('2016-01-01', '2020-12-31'))

for (week, year), subset in day_df.groupby([day_df.index.week, day_df.index.year]):
     if week == 1:
         print('Week:', subset.index.min(), subset.index.max())

Week: 1 2016-01-04 00:00:00 2016-01-10 00:00:00
Week: 1 2017-01-02 00:00:00 2017-01-08 00:00:00
Week: 1 2018-01-01 00:00:00 2018-12-31 00:00:00
Week: 1 2019-01-01 00:00:00 2019-12-31 00:00:00
Week: 1 2020-01-01 00:00:00 2020-01-05 00:00:00

For 2018 and 2019, the first day of the year ends up grouped with the final day of the year! Is this behavior expected? Why would the final day of the year be week 1?

I've gotten the result I want with a basic if statement, but this week behavior seems like it could lead to issues because it's unexpected.

This does what I intended with the grouping:

for (week, year), subset in day_df.groupby([day_df.index.week, day_df.index.year]):
    # Prevent first week of year from including final days of same year
    if set(subset.index.month.unique()) == set([1, 12]):
        subset = subset.loc[subset.index.month == 1]
    if week == 1:
        print('Week:', week, subset.index.min(), subset.index.max())

Week: 1 2016-01-04 00:00:00 2016-01-10 00:00:00
Week: 1 2017-01-02 00:00:00 2017-01-08 00:00:00
Week: 1 2018-01-01 00:00:00 2018-01-07 00:00:00
Week: 1 2019-01-01 00:00:00 2019-01-06 00:00:00
Week: 1 2020-01-01 00:00:00 2020-01-05 00:00:00
like image 338
willk Avatar asked Apr 28 '19 11:04

willk


1 Answers

The answer to this lies in the fact that .week() is a week ordinal. The .week() method is minimally defined in the docs as:

DatetimeIndex.week

The week ordinal of the year

The week ordinal is formally known as the ISO week date. Further notes about it in python can be found in under date.isocalendar() in the python 3.7.3 datetime docs. For a general explanation of how the week ordinal works, you can find full detail in wikipedia at ISO week date.

The 2019's week ordinal can be found on EpochConverter.com where it clearly shows the first day of the year being December 31. 2018.

If we look at week 1 for 2019, we can see that Dec 31 is the first day and starts the week 1 for 2019. So this in fact correctly meets your criteria for being included in your filter for the beginning of the year.

Below we filter the end of 2018 and the beginning of 2019 to see what .week is doing.

day_df["ordinal"] = day_df.index.week
day_df["day_of_week"] = day_df.index.weekday
print(day_df.loc["2018-12-28":"2019-01-08"])



             ordinal  day_of_week
2018-12-28       52            4
2018-12-29       52            5
2018-12-30       52            6
2018-12-31        1            0
2019-01-01        1            1
2019-01-02        1            2
2019-01-03        1            3
2019-01-04        1            4
2019-01-05        1            5
2019-01-06        1            6
2019-01-07        2            0
2019-01-08        2            1

You will need to add in a month criteria to ensure it is in January as you discovered in your question above. This works too.

for (week, month, year), subset in day_df.groupby(
    [day_df.index.week, day_df.index.month, day_df.index.year]
):
    if week == 1 and month == 1:
        print("Week:", subset.index.min(), subset.index.max())

If you wish to have the first week start on the same day, use [pandas.period.strftime()]5

%U is defined as

Week number of the year (Sunday as the first day of the week) as a decimal number [00,53]. All days in a new year preceding the first Sunday are considered to be in week 0.

For your dataframe, this would look like:

day_df['date'] = day_df.index
day_df["day_name"] = day_df['date'].dt.day_name()
day_df['str_from_time'] = day_df['date'].apply(lambda x: x.strftime("%U"))
day_df.loc["2018-12-28":"2019-01-08",['ordinal', 'str_from_time', 'day_of_week', 'day_name']]

            ordinal str_from_time  day_of_week   day_name
2018-12-28       52            51            4     Friday
2018-12-29       52            51            5   Saturday
2018-12-30       52            52            6     Sunday
2018-12-31        1            52            0     Monday
2019-01-01        1            00            1    Tuesday
2019-01-02        1            00            2  Wednesday
2019-01-03        1            00            3   Thursday
2019-01-04        1            00            4     Friday
2019-01-05        1            00            5   Saturday
2019-01-06        1            01            6     Sunday
2019-01-07        2            01            0     Monday
2019-01-08        2            01            1    Tuesday
like image 130
run-out Avatar answered Sep 25 '22 11:09

run-out